<video width="420" controls>
<source src="mov_bbb.mp4" type="video/mp4">
<source src="mov_bbb.ogg" type="video/ogg">
Your browser does not support the video tag.</video>
ENV222 NOTE
R in statistics (Advanced)
Hyperlink of XJTLU ENV222 courseware
- Lecture-1 ENV222 Overview
- Lecture-2 R Markdown basic
- Lecture-3 R Markdown advanced
- Lecture-4 R for characters
- Lecture-5 R for time data
- Lecture-6 Statistical graphs (advanced)
- Lecture-7 R Tidyverse
- Lecture-8 ANOVA Post-hoc tests
- Lecture-9 MANOVA
- Lecture-10 ANCOVA
- Lecture-11 MANCOVA
- Lecture-12 Combining statistics
- Lecture-13 Non-parametric hypothesis tests
- Lecture-14 Multiple linear regression
- Lecture-15 Logistic regression
- Lecture-16 Poisson regression
- Lecture-17 Non-linear regression
- Lecture-18 Tests for distributions
- Lecture-19 Principal component analysis
- Lecture-20 Cluster analysis
The meaning of each parameter in statistical table (Chinese)
- Df(自由度): 回归自由度(regression degrees of freedom)和误差自由度(error degrees of freedom)的总数,其中回归自由度为解释变量的个数减1,误差自由度为样本量减去解释变量的个数。
- Sum Sq(平方和): 每个来源的平方和(sum of squares),是变量离均差的平方和。回归平方和(regression sum of squares)表示自变量对因变量的影响程度,误差平方和(error sum of squares)表示自变量未能解释的部分。
- Mean Sq(均方): 每个来源的均方(mean square),是平方和除以自由度得到的平均数。回归均方(regression mean square)表示每个自变量对因变量的影响程度,误差均方(error mean square)表示自变量未能解释的部分的平均方差。
- F value: 回归均方与误差均方的比值,用于判断模型的拟合程度,F值越大则模型越好。在一元线性回归中,F值等于t值的平方。
- Pr(>F): Probability of obtaining a larger F value(得到更大的F值的概率)Pr(>F)是F检验得到的P值。p值越小则说明结果越显著,一般将显著性水平设为0.05,即当p值小于0.05时认为结果具有统计显著性。
- Pillai: Pillai trace(皮莱迹)是在多元方差分析中使用的一种统计量,用于衡量所有因素对因变量的共同影响程度。
- approx F: F approximation(F近似值)是根据Pillai迹值计算出来的F值。它用于评估多元方差分析的总体显著性。
- num Df: Numerator degrees of freedom(分子自由度)指的是分子中的自由度。
- den Df: Denominator degrees of freedom(分母自由度)指的是分母中的自由度。
- Residuals: 残差,是指多元方差分析中的误差项,即不能被自变量解释的因变量方差。
- Intercept: 截距,也称为常数项,表示当自变量为0时,因变量的预测值(或期望值)。
- Estimate: 回归系数,表示自变量每增加一个单位时,因变量发生的平均变化量。
- Std.Error: 标准误差,表示估计值的不确定性或误差,即估计值与真实值之间的平均差异。
- t value: t值,表示回归系数的显著性,即回归系数除以其标准误差,得到的值与t分布相比较的结果。
How to choose between ANOVA, MANOVA, ANCOVA and MANCOVA (Chinese)
ANOVA、MANOVA、ANCOVA和MANCOVA都是统计学中常见的分析方法,主要用于比较两个或多个组之间的差异性,并用统计学方法对这些差异进行推断和验证。
- ANOVA(Analysis of Variance):方差分析,用于比较两个或多个组的均值是否有显著差异,适用于只有一个自变量和一个因变量的情况。例如,用于比较不同教学方法对学生成绩的影响是否有显著差异。
- MANOVA(Multivariate Analysis of Variance):多元方差分析,用于比较两个或多个组的多个相关因变量是否有显著差异,适用于有多个相关因变量的情况。例如,用于比较不同教学方法对学生成绩、学生态度和学生动机等多个方面的影响是否有显著差异。
- ANCOVA(Analysis of Covariance):协方差分析,用于比较两个或多个组的均值是否有显著差异,并控制一个或多个协变量(即影响因素),适用于需要控制其他因素影响的情况。例如,用于比较不同教学方法对学生成绩的影响是否有显著差异,同时控制学生的初始水平,避免学生初始水平的不同对比较结果产生影响。
- MANCOVA(Multivariate Analysis of Covariance):多元协方差分析,同时控制多个协变量,比较两个或多个组的多个相关因变量是否有显著差异,适用于有多个相关因变量和多个协变量的情况。例如,用于比较不同教学方法对学生成绩、学生态度和学生动机等多个方面的影响是否有显著差异,同时控制学生的初始水平、性别、年龄等因素的影响。
👉🏻Click to enter the ENV222 exercise section
👉🏻Click to enter the ENV221 note section
1 R-markdown(qmd, html) syntax
1.1 Fundamental
Subscript by Rmarkdown: Use
PM~2.5~
to form PM2.5.
Subscript by html:log<sub>2</sub>
will be displayed as log2.Superscript by Rmarkdown: Use
R^2^
to form R2.
Superscript by html:2<sup>n</sup>
will be displayed as 2n.Use
$E = mc^2$
to form \(E = mc^2\)Use
[Link of XJTLU](http://xjtlu.edu.cn)
to form Link of XJTLUUse
<mark>text</mark>
to highlight the textUse
<u>text</u>
to add underline to the textUse
<center><img src="images/rstudio-qmd-how-it-works.png" width="1400" height="257"/>
or<center> ![rstudio qmd how it works](images/rstudio-qmd-how-it-works.png){width=100%}
to formUse
<iframe src="https://www.r-project.org/" width="100%" height="300px"></iframe>
to form a windows which show another file on-line, like this:Use the following HTML code to add a video to your Rmarkdown(HTML):
Use something like
{r, fig.width = 6, fig.height = 4, fig.align='center'}
in front of the code chunk to change the output graphicsAlso, use
{r, XXX-Plot, fig.cap="XXX-Plot"}
in the front of code chunk to add a caption of this figureUse something like
<span style="color:red; font-weight:bold; font-size:16px; font-family:Tahoma;">sentence</span>
to change the properties of textWhen we need to post some statistic analysis and formula in Rmarkdown we can use these code template below (fill
data
in):
# Equation
::extract_eq(data, use_coefs = TRUE)
equatiomatic# Analysis
::stargazer(data, type = 'text') stargazer
- Use the following HTML code to add a foldable item to your Rmarkdown(HTML):
<details>
<summary>title</summary>
content</details>
title
content- Use
| Name | Math | English |
|:----:|:-----|--------:|
| Tom | 93 | 100 |
| Mary | 60 | 90 |
to form
Name | Math | English |
---|---|---|
Tom | 93 | 100 |
Mary | 60 | 90 |
- Or use
library(knitr)
<- data.frame(
df Math = c(80, 90),
English = c(85, 95),
row.names = c("Tom", "Mary")
)kable(df, format = "markdown")
Math | English | |
---|---|---|
Tom | 80 | 85 |
Mary | 90 | 95 |
- Use
- 1.
- 2.
- 1.
- 2.
- 3.
to form sub-rank like this below:
1.2 Advanced
- Numbering, caption, and cross-reference of R-plots in academic paper Click to see the detail in ENV222 Week5-5.2
2 Basic R charaters
2.1 Check the data type
# import dataset
<- 'The world is on the verge of being flipped topsy-turvy.'
x <- read.csv('data/student_names.csv')
dtf head(dtf)
Name Prgrm
1 Yuzhou Liu Bio
2 Penghui Wang Bio
3 Ziang Li Eco
4 Youcheng Jin Bio
5 Yuge Yang Bio
6 Yutian Song Eco
# data type
class(x)
[1] "character"
# length of the dataset
length(x)
[1] 1
# length of the sub dataset
nchar(x)
[1] 55
2.2 Index to maximum and minimum values
- Find the longest name in student_names.csv,
which.max
orwhich.min
is used to find the index of the (first) minimum or maximum of a numeric (or logical) vector
<- nchar(dtf$Name)
name_n <- which.max(name_n)
name_nmax $Name[name_nmax] dtf
[1] "Iyan Hafiz Bin Mohamed Faizel"
# or
$Name[which.max(nchar((dtf$Name)))] dtf
[1] "Iyan Hafiz Bin Mohamed Faizel"
# or
library(magrittr)
$Name %>% nchar() %>% which.max() %>% dtf$Name[.] dtf
[1] "Iyan Hafiz Bin Mohamed Faizel"
2.3 Capital and small letter
# tolower() toupper()
<- toupper(x)) (xupper
[1] "THE WORLD IS ON THE VERGE OF BEING FLIPPED TOPSY-TURVY."
$pro <- tolower(dtf$Prgrm)) (dtf
[1] "bio" "bio" "eco" "bio" "bio" "eco" "bio" "bio" "eco" "bio" "eco" "bio"
[13] "bio" "env" "bio" "bio" "bio" "env" "bio" "env" "bio" "bio" "bio" "bio"
[25] "eco" "eco" "env" "bio" "bio" "bio" "bio" "env" "bio" "bio" "bio" "bio"
[37] "bio" "eco" "eco" "bio" "bio" "bio" "bio" "bio" "bio" "bio" "bio" "eco"
[49] "bio" "bio" "env" "eco" "eco" "bio" "env" "env" "env"
2.4 Split string
# strsplit()
<- strsplit(xupper, ' ')
x_word class(x_word)
[1] "list"
# If you want to extract the first element in the list, you need to use double brackets [[]], and if you want to extract the first sublist in the list, use single brackets []
<- x_word[[1]]
x_word1 class(x_word1)
[1] "character"
table(x_word1) # Form a table which involved the frequency of each char acter
x_word1
BEING FLIPPED IS OF ON THE
1 1 1 1 1 2
TOPSY-TURVY. VERGE WORLD
1 1 1
!duplicated(x_word1)] # Find the distinct characters in the list by use duplicated() function x_word1[
[1] "THE" "WORLD" "IS" "ON" "VERGE"
[6] "OF" "BEING" "FLIPPED" "TOPSY-TURVY."
unique(x_word1) # Other way yo detect the distinct characters
[1] "THE" "WORLD" "IS" "ON" "VERGE"
[6] "OF" "BEING" "FLIPPED" "TOPSY-TURVY."
lapply(x_word, length) # The output of the lapply() function is a list
[[1]]
[1] 10
sapply(x_word, length) # The output of the sapply() function is a vector or a matrix
[1] 10
sapply(x_word, nchar)
[,1]
[1,] 3
[2,] 5
[3,] 2
[4,] 2
[5,] 3
[6,] 5
[7,] 2
[8,] 5
[9,] 7
[10,] 12
2.5 Separate column
# separate() is a function in the tidyr package that can be used to split a column in a data box into multiple columns
library(tidyr)
<- separate(dtf, Name, c("GivenName", "LastName"), sep = ' ') # separate(data, col, into, sep)
dtf2 $FamilyName <- dtf2$LastName
dtfhead(dtf)
Name Prgrm pro FamilyName
1 Yuzhou Liu Bio bio Liu
2 Penghui Wang Bio bio Wang
3 Ziang Li Eco eco Li
4 Youcheng Jin Bio bio Jin
5 Yuge Yang Bio bio Yang
6 Yutian Song Eco eco Song
2.6 Extract
The world is on the verge of being flipped topsy-turvy.
# substr() is a built-in function in R that can be used to extract or replace substrings from a character vector
substr(x, 13, 15) # substr(x, start, stop)
[1] " on"
$NameAbb <- substr(dtf$Name, 1, 1)
dtfhead(dtf, 3)
Name Prgrm pro FamilyName NameAbb
1 Yuzhou Liu Bio bio Liu Y
2 Penghui Wang Bio bio Wang P
3 Ziang Li Eco eco Li Z
2.7 Connect
# paste() function can convert multiple objects into character vectors and concatenate them
paste(x, '<end>', sep = ' ') # paste(x1, x2,... sep, collapse)
[1] "The world is on the verge of being flipped topsy-turvy. <end>"
paste(dtf$NameAbb, '.', sep = '')
[1] "Y." "P." "Z." "Y." "Y." "Y." "H." "A." "E." "J." "Y." "M." "Y." "Y." "X."
[16] "Z." "W." "J." "Q." "Y." "F." "Y." "Z." "M." "Z." "S." "X." "J." "R." "Z."
[31] "Y." "X." "Z." "Y." "Q." "Y." "Y." "M." "Q." "J." "Y." "X." "M." "Q." "Y."
[46] "P." "Y." "J." "L." "Y." "Q." "H." "Z." "H." "J." "Y." "I."
paste(dtf$NameAbb, collapse = ' ') # collapse = ' ' put all of the characters into a character
[1] "Y P Z Y Y Y H A E J Y M Y Y X Z W J Q Y F Y Z M Z S X J R Z Y X Z Y Q Y Y M Q J Y X M Q Y P Y J L Y Q H Z H J Y I"
paste(dtf$NameAbb, dtf$FamilyName, sep = '. ')[7] # This is my name for academic essay cite
[1] "H. Zhu"
2.8 Find
# grep() function in R is a built-in function that searches for a pattern match in each element of character
<- c("R", "Python", "Java")
y grep("Java", y)
[1] 3
for(i in 1:length(y)) {
print(grep(as.character(y[i]), y))
}
[1] 1
[1] 2
[1] 3
sapply(y, function(x) grep(x, y))
R Python Java
1 2 3
head(table(dtf2$GivenName), 12)
Adriel Elina Fengyi Haode Hongli Huangtianchi
1 1 1 1 1 1
Iyan Jiajie Jiawei Jiayi Jingyun Jiumei
1 1 1 2 1 1
grep('Jiayi', dtf$Name, value = TRUE)
[1] "Jiayi Chen" "Jiayi Guo"
grep('Jiayi|Guo', dtf$Name, value = TRUE)
[1] "Jiayi Chen" "Fengyi Guo" "Jiayi Guo"
# regexpr() function is used to identify the position of the pattern in the character vector, where each element is searched separately.
<- c("R is fun", "R is cool", "R is awesome")
z regexpr("is", z) # Returns include starting position, duration length, data type ...
[1] 3 3 3
attr(,"match.length")
[1] 2 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
gregexpr("is", z) # The gregexpr() function returns all matching positions and lengths, as a list
[[1]]
[1] 3
attr(,"match.length")
[1] 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
[[2]]
[1] 3
attr(,"match.length")
[1] 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
[[3]]
[1] 3
attr(,"match.length")
[1] 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
2.9 Replace
# gsub()
gsub(' ', '-', x)
[1] "The-world-is-on-the-verge-of-being-flipped-topsy-turvy."
2.10 Regular expression
# help(regex)
# Find the one who has a given name with 4 letters and a family name with 4 letters
grep('^[[:alpha:]]{4} [[:alpha:]]{4}$', dtf$Name, value = TRUE)
[1] "Yuge Yang" "Ziyu Yuan" "Qian Chen" "Ying Zhou"
# Here, parentheses are used to create a capturing group. A capturing group is a subexpression of a regular expression that can capture and store the matched text during matching.
# In this example, the capturing group is used to extract the first word from the string. Without a capturing group, the entire matched string would be replaced with \\1 instead of just the first word.
$FirstName <- gsub('^([^ ]+).+[^ ]+$', '\\1', dtf$Name)
dtfhead(dtf)
Name Prgrm pro FamilyName NameAbb FirstName
1 Yuzhou Liu Bio bio Liu Y Yuzhou
2 Penghui Wang Bio bio Wang P Penghui
3 Ziang Li Eco eco Li Z Ziang
4 Youcheng Jin Bio bio Jin Y Youcheng
5 Yuge Yang Bio bio Yang Y Yuge
6 Yutian Song Eco eco Song Y Yutian
Rmarkdown 中正则表达式的基本语法如下:
. 匹配任意单个字符,除了换行符。
[ ] 匹配方括号内的任意一个字符,例如 [abc] 匹配 a 或 b 或 c。^ ] 匹配方括号外的任意一个字符,例如 [^abc] 匹配除了 a 和 b 和 c 之外的任意字符。
[- 在方括号内表示范围,例如 [a-z] 匹配小写字母, [0-9] 匹配数字。
\d \D \w \W \s \S 分别匹配数字、非数字、单词字符(字母、数字和下划线)、非单词字符、空白符(空格、制表符和换行符)、非空白符。^ $ \ 分别匹配单词边界(单词和非单词之间)、非单词边界(两个单词或两个非单词之间)、字符串开头、字符串结尾、转义符(用于匹配元字符本身)。
\b \B | ? + * { } \ 分别匹配分组或捕获子表达式(可以用反斜杠加数字引用),选择(匹配左边或右边),零次或一次重复,一次或多次重复,零次或多次重复,指定重复次数,零宽断言(匹配位置而不是字符)。
( ) ://www.example.com)):
简单的例子,查找 Markdown 链接([This is a link](https^\]]+)\]\(([^)]+)\)
\[([
这个正则表达式可以分解为以下部分:
\[ 匹配左方括号^\]]+) 匹配并捕获一个或多个不是右方括号的字符
([
\] 匹配右方括号
\( 匹配左圆括号^)]+) 匹配并捕获一个或多个不是右圆括号的字符
([) 匹配右圆括号 \
3 Time data in R
3.1 Format of time
# Check the current date
date()
[1] "Mon May 22 15:43:52 2023"
# character
<- "2/11/1962"
d1 # Date/Time format, we can just directly use like "d2 + 1" to add 1 day to d2
<- Sys.Date()
d2 <- Sys.time()
t2 # Check their type
t(list(class(d1), class(d2), class(t2)))
[,1] [,2] [,3]
[1,] "character" "Date" character,2
3.2 Numeric of date
# Use format="" to identify the character to date
<- as.Date("2/11/1962", format="%d/%m/%Y" )
d3 as.numeric(d3)
[1] -2617
+ 2617 d3
[1] "1970-01-01"
format(d3, '%Y %m %d')
[1] "1962 11 02"
format(d3, "%Y %B %d %A")
[1] "1962 November 02 Friday"
# Different format will have different meaning
<- as.Date( "2/11/1962", format="%m/%d/%Y" )
d4 == d4 d3
[1] FALSE
3.2.1 Time format codes
%Y
: Four-digit year
%y
: Two-digit year
%m
: Two-digit month (01~12)
%d
: Two-digit day of the month (01~31)
%H
: Hour in 24-hour format (00~23)
%M
: Two-digit minute (00~59)
%S
: Two-digit second (00~59)
%z
: Time zone offset, for example +0800
%Z
: Time zone name, for example CST
3.3 Calculating date
# import built-in data diet (The data concern a subsample of subjects drawn from larger cohort studies of the incidence of coronary heart disease (CHD))
library('Epi')
data("diet")
str(diet)
'data.frame': 337 obs. of 15 variables:
$ id : num 102 59 126 16 247 272 268 206 182 2 ...
$ doe : Date, format: "1976-01-17" "1973-07-16" ...
$ dox : Date, format: "1986-12-02" "1982-07-05" ...
$ dob : Date, format: "1939-03-02" "1912-07-05" ...
$ y : num 10.875 8.969 14.01 0.627 11.274 ...
$ fail : num 0 0 13 3 13 3 0 0 13 0 ...
$ job : Factor w/ 3 levels "Driver","Conductor",..: 1 1 2 1 3 3 3 3 2 1 ...
$ month : num 1 7 3 5 3 3 2 1 3 12 ...
$ energy : num 22.9 23.9 25 22.2 18.5 ...
$ height : num 182 166 152 171 178 ...
$ weight : num 88.2 58.7 49.9 89.4 97.1 ...
$ fat : num 9.17 9.65 11.25 7.58 9.15 ...
$ fibre : num 1.4 0.935 1.248 1.557 0.991 ...
$ energy.grp: Factor w/ 2 levels "<=2750 KCals",..: 1 1 1 1 1 1 1 1 1 1 ...
$ chd : num 0 0 1 1 1 1 0 0 1 0 ...
# Prepare data which we will deal with
<- diet$dox[1]
bdat bdat
[1] "1986-12-02"
# Some basic calculation between dates
+ 1 bdat
[1] "1986-12-03"
$dox2 <- format(diet$dox, format="%A %d %B %Y")
diethead(diet$dox2, 3)
[1] "Tuesday 02 December 1986" "Monday 05 July 1982"
[3] "Tuesday 20 March 1984"
# Some advanced calculation between dates
max(diet$dox)
[1] "1986-12-02"
range(diet$dox)
[1] "1968-08-29" "1986-12-02"
mean(diet$dox)
[1] "1984-02-20"
median(diet$dox)
[1] "1986-12-02"
diff(range(diet$dox))
Time difference of 6669 days
difftime(min(diet$dox), max(diet$dox), units = "weeks") # Set unit
Time difference of -952.7143 weeks
# Epi::cal.yr() function converts the date format to numeric format
<- Epi::cal.yr(diet)
diet2 str(diet2)
'data.frame': 337 obs. of 16 variables:
$ id : num 102 59 126 16 247 272 268 206 182 2 ...
$ doe : 'cal.yr' num 1976 1974 1970 1969 1968 ...
$ dox : 'cal.yr' num 1987 1983 1984 1970 1979 ...
$ dob : 'cal.yr' num 1939 1913 1920 1907 1919 ...
$ y : num 10.875 8.969 14.01 0.627 11.274 ...
$ fail : num 0 0 13 3 13 3 0 0 13 0 ...
$ job : Factor w/ 3 levels "Driver","Conductor",..: 1 1 2 1 3 3 3 3 2 1 ...
$ month : num 1 7 3 5 3 3 2 1 3 12 ...
$ energy : num 22.9 23.9 25 22.2 18.5 ...
$ height : num 182 166 152 171 178 ...
$ weight : num 88.2 58.7 49.9 89.4 97.1 ...
$ fat : num 9.17 9.65 11.25 7.58 9.15 ...
$ fibre : num 1.4 0.935 1.248 1.557 0.991 ...
$ energy.grp: Factor w/ 2 levels "<=2750 KCals",..: 1 1 1 1 1 1 1 1 1 1 ...
$ chd : num 0 0 1 1 1 1 0 0 1 0 ...
$ dox2 : chr "Tuesday 02 December 1986" "Monday 05 July 1982" "Tuesday 20 March 1984" "Wednesday 31 December 1969" ...
3.4 Set time zone & calculation
<- '1994-09-22 20:30:00'
bd class(bd)
[1] "character"
<- strptime(x = bd, format = '%Y-%m-%d %H:%M:%S', tz = "Asia/Shanghai") # Set character to time format and add a time zone
bdtime class(bdtime)
[1] "POSIXlt" "POSIXt"
t(unclass(bdtime))
sec min hour mday mon year wday yday isdst zone gmtoff
[1,] 0 30 20 22 8 94 4 264 0 "CST" NA
attr(,"tzone")
[1] "Asia/Shanghai" "CST" "CDT"
$wday bdtime
[1] 4
format(bdtime, format = '%d.%m.%Y')
[1] "22.09.1994"
+ 1 bdtime
[1] "1994-09-22 20:30:01 CST"
# Also, some essential calculation
<- '1995-09-01 7:30:00'
bd2 <- strptime(bd2, format = '%Y-%m-%d %H:%M:%S', tz = 'Asia/Shanghai')
bdtime2 - bdtime bdtime2
Time difference of 343.4583 days
difftime(time1 = bdtime2, time2 = bdtime, units = 'secs') # Set unit
Time difference of 29674800 secs
mean(c(bdtime, bdtime2))
[1] "1995-03-13 14:00:00 CST"
4 LaTeX
4.1 Fundamental
- Use
$$e^{i\pi}+1=0$$
to form Euler’s Law expression \[e^{i\pi}+1=0\] - Hyperlink of a CN website for more detail about LaTeX
4.2 Advanced
- Here are some additional formulas from ENV221 statistic method:
- Z-test:
The LaTex expression for Z-test is: \[Z=\frac{\overline{x}-\mu}{\frac{\sigma}{\sqrt{n}}}\] where \(\overline{x}\) is the sample mean, \(\mu\) is the population mean, \(\sigma\) is the population standard deviation, and \(n\) is the sample size. - t-test:
The LaTex expression for t-test is: \[t=\frac{\overline{x}-\mu}{\frac{s}{\sqrt{n}}}\] where \(\overline{x}\) is the sample mean, \(\mu\) is the population mean, \(s\) is the sample standard deviation, and \(n\) is the sample size. - F-test:
The LaTex expression for F-test is: \[F=\frac{s_1^2}{s_2^2}\] where \(s_1^2\) is the variance of the first sample and \(s_2^2\) is the variance of the second sample. - Chi-square test:
The LaTex expression for the chi-square test is: \[\chi2=\sum_{i=1}{n}\frac{(O_i-E_i)^2}{E_i}\] where \(O_i\) represents observed values and \(E_i\) represents expected values.
- Z-test:
5 R graph (advanced)
5.1 Different theme of plot
library(ggplot2)
<- ggplot(CO2) + geom_point(aes(conc, uptake)) + theme_bw()
bw <- ggplot(CO2) + geom_point(aes(conc, uptake)) + theme_test()
test <- ggplot(CO2) + geom_point(aes(conc, uptake)) + theme_classic()
classic library(patchwork)
+ test + classic +
bw plot_layout(ncol = 3, widths = c(1, 1, 1), heights = c(1, 1, 1)) +
plot_annotation(
title = expression(CO[2] * " uptake by plant type plot with different theme"),
tag_levels = "A"
)
5.2 Math formulas with R
head(CO2)
Grouped Data: uptake ~ conc | Plant
Plant Type Treatment conc uptake
1 Qn1 Quebec nonchilled 95 16.0
2 Qn1 Quebec nonchilled 175 30.4
3 Qn1 Quebec nonchilled 250 34.8
4 Qn1 Quebec nonchilled 350 37.2
5 Qn1 Quebec nonchilled 500 35.3
6 Qn1 Quebec nonchilled 675 39.2
# fundamental expression
plot(CO2$conc, CO2$uptake, pch = 16, las = 1,
xlab = 'CO2 concentration', ylab = 'CO2 uptake')
# Advanced expression (Use `?plotmath` to check more details of mathematical annotation in R)
plot(CO2$conc, CO2$uptake, pch = 16, las = 1,
xlab = expression('CO'[2] * ' concentration (mL/L)'),
ylab = expression('CO'[2] * ' uptake (' *mu * 'mol m'^-2 * 's'^-1 * ')'))
# LaTeX expression
library(latex2exp)
plot(CO2$conc, CO2$uptake, pch = 16, las = 1,
xlab = TeX('CO$_2$ concentration (mL/L)'),
ylab = TeX('CO$_2$ uptake ($\\mu$mol m$^{-2}$ s$^{-1}$)'))
text(850, 30, expression(prod(plain(P)(X == x), x)))
5.3 Size and layout
- ggplot2: patchwork package is used to range the size and layout of multiply plots
library(patchwork)
<- ggplot(airquality) + geom_boxplot(aes(as.factor(Month), Ozone))
p1 <- ggplot(airquality) + geom_point(aes(Solar.R, Ozone))
p2 <- ggplot(airquality) + geom_histogram(aes(Ozone))
p3 + p2 + p3 p1
+ p2 / p3 p1
+ p2) / p3 (p1
+ p2) / p3 + plot_annotation(tag_levels = 'A') +
(p1 plot_layout(ncol = 2, widths = c(1, 1), heights = c(1, 1)) # plot_layout() function is used to define the grid layout of the composite graph.
- Built-in par() function
par(mfrow = c(2, 3)) # Set the layout by using vector c(x, y)
plot(airquality$Solar.R, airquality$Ozone)
hist(airquality$Solar.R)
barplot(airquality$Month)
plot(airquality$Solar.R, airquality$Ozone)
hist(airquality$Solar.R)
barplot(airquality$Month)
- Built-in layout() function
# Use a matrix to store the information about layout
<- matrix(1:6, nrow = 2)
mymat layout(mymat)
plot(airquality$Solar.R, airquality$Ozone)
hist(airquality$Solar.R)
barplot(airquality$Month)
plot(airquality$Solar.R, airquality$Ozone)
hist(airquality$Solar.R)
barplot(airquality$Month)
# Also, customize the exact layout by using some parameters like 'widths=' and 'heights=' by filling vector
<- matrix(c(1, 1:5), nrow = 2)
mymat # Check the matrix which was used to layout plots mymat
[,1] [,2] [,3]
[1,] 1 2 4
[2,] 1 3 5
layout(mymat, widths = c(1, 1, 2), heights = c(1, 2))
plot(airquality$Solar.R, airquality$Ozone)
hist(airquality$Solar.R)
barplot(airquality$Month)
plot(airquality$Solar.R, airquality$Ozone)
hist(airquality$Solar.R)
# This is an example from quiz1. Also, please check the exercises to view more difficult questions
<- matrix(c(1, 2, 3, 0), nrow = 2)
mymat # Check the matrix which was used to layout plots mymat
[,1] [,2]
[1,] 1 3
[2,] 2 0
layout(mymat, widths = c(4, 1), heights = c(2, 1)) # Set the ratio between widths and heights
plot(iris$Sepal.Length, iris$Sepal.Width, pch=20, xlab='Sepal Length (cm)', ylab='Sepal Width (cm)', las=1)
boxplot(iris$Sepal.Length, pch=20, las=1, horizontal=T)
boxplot(iris$Sepal.Width, pch=20, las=2)
6 R Tidyverse
6.1 Workflow
6.2 Fundamental operations
# Load the package
library(tidyverse)
# Check the members of them
tidyverse_packages()
[1] "broom" "conflicted" "cli" "dbplyr"
[5] "dplyr" "dtplyr" "forcats" "ggplot2"
[9] "googledrive" "googlesheets4" "haven" "hms"
[13] "httr" "jsonlite" "lubridate" "magrittr"
[17] "modelr" "pillar" "purrr" "ragg"
[21] "readr" "readxl" "reprex" "rlang"
[25] "rstudioapi" "rvest" "stringr" "tibble"
[29] "tidyr" "xml2" "tidyverse"
Core members and their function:
ggplot2
: Creating graphicsdplyr
: Data manipulationtidyr
: Get to tidy datareadr
: Read rectangular datapurrr
: Functional programmingtibble
: Re-imagining of the data framestringr
: Working with stringsforcats
: Working with factors
6.3 Pipe operator
The pipe operator can be written as %>%
or |>
<- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)
x # Method 1:
<- log(x)
y1 <- diff(y1)
y2 <- exp(y2)
y3 <- round(y3)
z # Method 2
<- round(exp(diff(log(x))))
z # Pipe method
<- x %>% log() %>% diff() %>% exp() %>% round() z
6.4 Mutiply work by using tidyverse pipe
6.4.1 Graph work
# By using R built-in par() function and a loop
par(mfrow = c(2, 2))
for (i in 1:4) {
boxplot(iris[, i] ~ iris$Species, las = 1, xlab = 'Species', ylab = names(iris)[i])
}
# By using pivot_longer() function and tidyverse pipe
|> pivot_longer(-Species) |> ggplot() + geom_boxplot(aes(Species, value)) + facet_wrap(name ~.) iris
6.4.2 Statistic work
# base R
<- data.frame(Species = unique(iris$Species), Mean_Sepal_Length = tapply(iris$Sepal.Length, iris$Species, mean, na.rm = TRUE))
dtf1_mean <- data.frame(Species = unique(iris$Species), SD_Sepal_Length = tapply(iris$Sepal.Length, iris$Species, sd, na.rm = TRUE))
dtf1_sd <- data.frame(Species = unique(iris$Species), Median_Sepal_Length = tapply(iris$Sepal.Length, iris$Species, median, na.rm = TRUE))
dtf1_median names(dtf1_mean) <- c("Species", "Mean_Sepal_Length")
names(dtf1_sd) <- c("Species", "SD_Sepal_Length")
names(dtf1_median) <- c("Species", "Median_Sepal_Length")
cbind(dtf1_mean, dtf1_sd, dtf1_median) # Show them in one table
Species Mean_Sepal_Length Species SD_Sepal_Length Species
setosa setosa 5.006 setosa 0.3524897 setosa
versicolor versicolor 5.936 versicolor 0.5161711 versicolor
virginica virginica 6.588 virginica 0.6358796 virginica
Median_Sepal_Length
setosa 5.0
versicolor 5.9
virginica 6.5
# use a loop
<- data.frame(rep(NA, 3))
dtf for (i in 1:4) {
<- data.frame(tapply(iris[, i], iris$Species, mean, na.rm = TRUE))
dtf1_mean <- data.frame(tapply(iris[, i], iris$Species, sd, na.rm = TRUE))
dtf1_sd <- data.frame(tapply(iris[, i], iris$Species, median, na.rm = TRUE))
dtf1_median <- cbind(dtf1_mean, dtf1_sd, dtf1_median)
dtf1 names(dtf1) <- paste0(names(iris)[i], '.', c('mean', 'sd', 'median'))
<- cbind(dtf, dtf1)
dtf
} dtf
rep.NA..3. Sepal.Length.mean Sepal.Length.sd Sepal.Length.median
setosa NA 5.006 0.3524897 5.0
versicolor NA 5.936 0.5161711 5.9
virginica NA 6.588 0.6358796 6.5
Sepal.Width.mean Sepal.Width.sd Sepal.Width.median Petal.Length.mean
setosa 3.428 0.3790644 3.4 1.462
versicolor 2.770 0.3137983 2.8 4.260
virginica 2.974 0.3224966 3.0 5.552
Petal.Length.sd Petal.Length.median Petal.Width.mean Petal.Width.sd
setosa 0.1736640 1.50 0.246 0.1053856
versicolor 0.4699110 4.35 1.326 0.1977527
virginica 0.5518947 5.55 2.026 0.2746501
Petal.Width.median
setosa 0.2
versicolor 1.3
virginica 2.0
# tidyverse
<- iris |>
dtf pivot_longer(-Species) |>
group_by(Species, name) |>
summarise(mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
median = median(value, na.rm = TRUE),
.groups = "drop")
dtf
# A tibble: 12 × 5
Species name mean sd median
<fct> <chr> <dbl> <dbl> <dbl>
1 setosa Petal.Length 1.46 0.174 1.5
2 setosa Petal.Width 0.246 0.105 0.2
3 setosa Sepal.Length 5.01 0.352 5
4 setosa Sepal.Width 3.43 0.379 3.4
5 versicolor Petal.Length 4.26 0.470 4.35
6 versicolor Petal.Width 1.33 0.198 1.3
7 versicolor Sepal.Length 5.94 0.516 5.9
8 versicolor Sepal.Width 2.77 0.314 2.8
9 virginica Petal.Length 5.55 0.552 5.55
10 virginica Petal.Width 2.03 0.275 2
11 virginica Sepal.Length 6.59 0.636 6.5
12 virginica Sepal.Width 2.97 0.322 3
6.5 Tidy the dataset
# Original dataset of table1
table1
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
# Compute rate per 10,000
%>% mutate(rate = cases / population * 10000) table1
# A tibble: 6 × 5
country year cases population rate
<chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071 0.373
2 Afghanistan 2000 2666 20595360 1.29
3 Brazil 1999 37737 172006362 2.19
4 Brazil 2000 80488 174504898 4.61
5 China 1999 212258 1272915272 1.67
6 China 2000 213766 1280428583 1.67
# Compute cases per year
%>% count(year, wt = cases) table1
# A tibble: 2 × 2
year n
<dbl> <dbl>
1 1999 250740
2 2000 296920
6.6 Conversions the dataframe type
# Original dataset of table2
table2
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
# Divided the type into cases and population
%>% pivot_wider(names_from = type, values_from = count) table2
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
# Original dataset of table3
table3
# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
# Separate the rate into cases and population
%>% separate(col = rate, into = c("cases", "population"), sep = "/") table3
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <chr> <chr>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
# Original dataset of table4a and table4b
cbind(table4a, table4b)
country 1999 2000 country 1999 2000
1 Afghanistan 745 2666 Afghanistan 19987071 20595360
2 Brazil 37737 80488 Brazil 172006362 174504898
3 China 212258 213766 China 1272915272 1280428583
# Put table4a and table4b together to form a new table with both of their dataset
<- table4a %>% pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
tidy4a_changed <- table4b %>% pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
tidy4b_changed left_join(tidy4a_changed, tidy4b_changed) ## Kind of like MySQL
# A tibble: 6 × 4
country year cases population
<chr> <chr> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
6.7 Find missing observations
library(openair)
library(tidyverse)
# create a function to count missing observations
<- function(x){
sum_of_na sum(is.na(x))
}%>% summarise(
mydata across(everything(), sum_of_na)
)
# A tibble: 1 × 10
date ws wd nox no2 o3 pm10 so2 co pm25
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 0 632 219 2423 2438 2589 2162 10450 1936 8775
7 ANOVA (Post-hoc tests)
7.1 Post-hoc tests
Background informations: A biologist studies the weight gain of male lab rats on diets over a 4-week period. Three different diets are applied.
# Statistic anlysis
<- data.frame(diet1 = c(90, 95, 100),
(dtf diet2 = c(120, 125, 130),
diet3 = c(125, 130, 135)))
diet1 diet2 diet3
1 90 120 125
2 95 125 130
3 100 130 135
<- stack(dtf)
dtf2 names(dtf2) <- c("wg", "diet")
<- aov(wg ~ diet, data = dtf2)
wg_aov summary(wg_aov)
Df Sum Sq Mean Sq F value Pr(>F)
diet 2 2150 1075 43 0.000277 ***
Residuals 6 150 25
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Visualization
library(ggplot2)
ggplot(dtf2) + geom_boxplot(aes(wg, diet))
7.2 Fisher’s Least Significant Difference (LSD) Test
7.2.1 Concept
Pair-wise comparisons of all the groups based on the \(t\)-test: \[L S D=t_{\alpha / 2} \sqrt{S_{p}^{2}\left(\frac{1}{n_1}+\frac{1}{n_2}+\cdots\right)}\] \[S_{p}^{2}=\frac{\left(n_{1}-1\right) S_{1}^{2}+\left(n_{2}-1\right) S_{2}^{2}+\left(n_{3}-1\right) S_{3}^{2}+\cdots}{\left(n_{1}-1\right)+\left(n_{2}-1\right)+\left(n_{3}-1\right)+\cdots}\]
- \(S_{p}^{2}:\): pooled standard deviation (some use Mean Standard Error)
- \(t_{\alpha / 2}: \mathrm{t}\): \(t\) critical value at \(\alpha=0.025\)
- Degree of freedom: \(N - k\)
- \(N\): total observations
- \(k\): number of factors
- If \(\left|\bar{x}_{1}-\bar{x}_{2}\right|>L S D\), then the difference of \(x_1\) group and \(x_2\) group is significant at \(\alpha\).
- In multiple comparisons (\(k\) factors), the number of comparison needed is: \(\frac{k(k-1)}{2}\)
7.2.2 Example
(Rats on diets in the previous section)
- Step by step
# Calculate LSD
<- nrow(dtf2)
n <- nlevels(dtf2$diet)
k <- n - k
dfree <- qt(0.05/2, df = dfree, lower.tail = FALSE)
t_critical <- sum((3 - 1) * apply(dtf, 2, sd) ^ 2)/ dfree
sp2 <- t_critical * sqrt(sp2 * (1/3 + 1/3 + 1/3))
LSD # Calculate |mean_x1-mean_x2|
<- colMeans(dtf)
dtf_groupmean <- combn(dtf_groupmean, 2)
paired_groupmean 2, ] - paired_groupmean[1, ] paired_groupmean[
[1] 30 35 5
LSD
[1] 12.23456
- Step by step by using loop
library(dplyr)
<- dtf2 |>
dtf_sm group_by(diet) |>
summarise(n = length(wg), sd = sd(wg), mean = mean(wg))
<- sum((dtf_sm$n - 1) * dtf_sm$sd ^ 2 )/ dfree
sp2 <- t_critical * sqrt(sp2 * sum(1 / dtf_sm$n))
LSD <- combn(dtf_sm$mean, 2)
paired_groupmean 2, ] - paired_groupmean[1, ] paired_groupmean[
[1] 30 35 5
- One step
library(agricolae)
# Statistic analysis
LSD.test(wg_aov, "diet", p.adj = "bonferroni") |> print()
$statistics
MSerror Df Mean CV t.value MSD
25 6 116.6667 4.285714 3.287455 13.42098
$parameters
test p.ajusted name.t ntr alpha
Fisher-LSD bonferroni diet 3 0.05
$means
wg std r LCL UCL Min Max Q25 Q50 Q75
diet1 95 5 3 87.93637 102.0636 90 100 92.5 95 97.5
diet2 125 5 3 117.93637 132.0636 120 130 122.5 125 127.5
diet3 130 5 3 122.93637 137.0636 125 135 127.5 130 132.5
$comparison
NULL
$groups
wg groups
diet3 130 a
diet2 125 a
diet1 95 b
attr(,"class")
[1] "group"
The meaning of each parameter in this table
- $statistics:包含ANOVA分析的统计量(statistics)的列表。
- MSerror:平均方差误差(mean square error),也称为残差方差,表示模型误差的平均程度。
- Df:自由度(degrees of freedom)。
- Mean:均值(mean)。
- CV:变异系数(coefficient of variation),变异系数越大,说明数据的离散程度越大。
- t.value:t值(t-value),表示组间均值之间的显著性差异程度。
- MSD:最小显著差异(minimum significant difference),表示在显著性水平下两个组之间的最小显著差异值。
- $parameters:包含对比分析的参数(parameters)的列表。
- test:所使用的多重比较方法(multiple comparison method),此处为Fisher-LSD法。
- p.ajusted:经过校正后的显著性水平(adjusted significance level),此处为Bonferroni法。 -name.t:所进行对比分析的因素名称(name of tested factor),此处为diet。
- ntr:因素水平数(number of treatments),此处为3。
- alpha:显著性水平(significance level),此处为0.05。
- $means:包含各组均值和统计信息(means and statistics)的列表。
- wg:组均值(group mean)。
- std:标准差(standard deviation)。
- r:重复次数(replications)。
- LCL:下限置信区间(lower confidence limit)。
- UCL:上限置信区间(upper confidence limit)。
- Min:最小值(minimum value)。
- Max:最大值(maximum value)。
- Q25:25%分位数(25th percentile)。
- Q50:50%分位数(50th percentile),也称为中位数。
- Q75:75%分位数(75th percentile)。
- $comparison:对比分析结果(comparison),此处为空。
- $groups:分组结果(groups)。
- wg:组均值(group mean)。
- groups:分组结果(groups),使用字母表示不同组别,相同字母表示在统计上没有显著差异。
- attr(,“class”):结果的类别(class),此处为”group”。
# Visualization
LSD.test(wg_aov, "diet", p.adj = "bonferroni") |> plot()
box()
Conclusion: At \(\alpha = 0.05\), Diet 2 and Diet 3 are significantly different from Diet 1 in the mean weight gain, while Diet 2 is not significantly different from Diet 3.
7.3 Bonferroni t-test
7.3.1 Concept
A multiple-comparison post-hoc test, which sets the significance cut off at \(\alpha/m\) for each comparison, where \(m\) represents the number of comparisons we apply.
Overall chance of making a Type I error:
<- 1:100
m <- 0.05
siglevel <- 1 - (1 - (siglevel / m)) ^ m
Type_I Type_I
[1] 0.05000000 0.04937500 0.04917130 0.04907029 0.04900995 0.04896984
[7] 0.04894124 0.04891982 0.04890317 0.04888987 0.04887899 0.04886993
[13] 0.04886227 0.04885571 0.04885002 0.04884504 0.04884065 0.04883675
[19] 0.04883326 0.04883012 0.04882728 0.04882470 0.04882235 0.04882019
[25] 0.04881820 0.04881637 0.04881467 0.04881309 0.04881162 0.04881025
[31] 0.04880897 0.04880777 0.04880664 0.04880558 0.04880458 0.04880363
[37] 0.04880274 0.04880189 0.04880109 0.04880033 0.04879960 0.04879891
[43] 0.04879825 0.04879762 0.04879702 0.04879644 0.04879589 0.04879536
[49] 0.04879486 0.04879437 0.04879390 0.04879346 0.04879302 0.04879261
[55] 0.04879221 0.04879182 0.04879145 0.04879109 0.04879074 0.04879040
[61] 0.04879008 0.04878976 0.04878946 0.04878916 0.04878888 0.04878860
[67] 0.04878833 0.04878807 0.04878782 0.04878757 0.04878733 0.04878710
[73] 0.04878687 0.04878665 0.04878644 0.04878623 0.04878602 0.04878583
[79] 0.04878563 0.04878544 0.04878526 0.04878508 0.04878491 0.04878474
[85] 0.04878457 0.04878441 0.04878425 0.04878409 0.04878394 0.04878379
[91] 0.04878365 0.04878350 0.04878337 0.04878323 0.04878310 0.04878297
[97] 0.04878284 0.04878271 0.04878259 0.04878247
7.3.2 Example
(Rats on diets in the previous section)
- Step by step
<- choose(nlevels(dtf2$diet), 2) # 1:2 or 1:3 or 2:3
m <- 0.05 / m alpha_cor
# Pairwise comparison between diet1 and diet2
t.test(wg ~ diet, dtf2, subset = diet %in% c("diet1", "diet2"), conf.level = 1 - alpha_cor)
Welch Two Sample t-test
data: wg by diet
t = -7.3485, df = 4, p-value = 0.001826
alternative hypothesis: true difference in means between group diet1 and group diet2 is not equal to 0
98.33333 percent confidence interval:
-46.16984 -13.83016
sample estimates:
mean in group diet1 mean in group diet2
95 125
# Pairwise comparison between diet1 and diet3
t.test(wg ~ diet, dtf2, subset = diet %in% c("diet1", "diet3"), conf.level = 1 - alpha_cor)
Welch Two Sample t-test
data: wg by diet
t = -8.5732, df = 4, p-value = 0.001017
alternative hypothesis: true difference in means between group diet1 and group diet3 is not equal to 0
98.33333 percent confidence interval:
-51.16984 -18.83016
sample estimates:
mean in group diet1 mean in group diet3
95 130
# Pairwise comparison between diet2 and diet3
t.test(wg ~ diet, dtf2, subset = diet %in% c("diet2", "diet3"), conf.level = 1 - alpha_cor)
Welch Two Sample t-test
data: wg by diet
t = -1.2247, df = 4, p-value = 0.2879
alternative hypothesis: true difference in means between group diet2 and group diet3 is not equal to 0
98.33333 percent confidence interval:
-21.16984 11.16984
sample estimates:
mean in group diet2 mean in group diet3
125 130
- One step
<- pairwise.t.test(dtf2$wg, dtf2$diet, pool.sd = FALSE,var.equal = TRUE, p.adj = "none")) (diet_pt
Pairwise comparisons using t tests with non-pooled SD
data: dtf2$wg and dtf2$diet
diet1 diet2
diet2 0.0018 -
diet3 0.0010 0.2879
P value adjustment method: none
$p.value < 0.05 diet_pt
diet1 diet2
diet2 TRUE NA
diet3 TRUE FALSE
Conclusion: At \(\alpha = 0.05\), Diet 2 and Diet 3 are significantly different from Diet 1 in the mean weight gain, while Diet 2 is not significantly different from Diet 3.
8 MANOVA
8.1 Definition
Univariate Analysis of Variance (ANOVA):
- one dependent variable (continuous) ~ one or multiple independent variables (categorical).
Multivariate Analysis of Variance (MANOVA)
- multiple dependent variables (continuous) ~ one or multiple independent variables (categorical).
Comparing multivariate sample means. It uses the covariance between outcome variables in testing the statistical significance of the mean differences when there are multiple dependent variables.
Merit of MANOVA:
- Reduce the Type I error
- It allows for the analysis of multiple dependent variables simultaneously
- It provides information about the strength and direction of relationships
8.2 Workflow
Example: Influence of teaching methods on student satisfaction scores and exam scores.
<- read.csv('data/teaching_methods.csv')
dtf head(dtf, 3)
Method Test Satisfaction
1 1 3.000 3.001
2 1 2.990 2.994
3 1 3.041 3.032
# ANOVA between Test and Method
summary(aov(Test ~ Method, data = dtf))
Df Sum Sq Mean Sq F value Pr(>F)
Method 1 0.000578 0.0005780 2.426 0.126
Residuals 46 0.010958 0.0002382
# ANOVA between Satisfaction and Method
summary(aov(Satisfaction ~ Method, data = dtf))
Df Sum Sq Mean Sq F value Pr(>F)
Method 1 0.000032 0.0000320 0.135 0.715
Residuals 46 0.010944 0.0002379
# Visualization with Scatter plot
library(ggplot2)
library(tidyr)
|> pivot_longer(-Method) |>
dtf ggplot() +
geom_dotplot(aes(x = Method, y = value, group = Method), binaxis = "y", stackdir = "center") +
facet_wrap(name~.)
# Visualization with Box plot
par(mfrow = c(1, 3))
boxplot(Test ~ Method, data = dtf)
boxplot(Satisfaction ~ Method, data = dtf)
plot(dtf$Satisfaction, dtf$Test, col = dtf$Method, pch = 16, xlab = 'Satisfaction', ylab = 'Test')
# MANOVA method: use manova() function with multiple response variables ~ one or multiple factor
# column bind way
<- manova(cbind(dtf$Test, dtf$Satisfaction) ~ dtf$Method)
tm_manova # matrix way
<- manova(as.matrix(dtf[, c('Test', 'Satisfaction')]) ~ dtf$Method)
tm_manova summary(tm_manova)
Df Pillai approx F num Df den Df Pr(>F)
dtf$Method 1 0.45766 18.987 2 45 1.05e-06 ***
Residuals 46
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
8.3 One-way MANOVA
Example: The iris dataset. Do the species have influence on the sepal size?
# Visualization
library(ggplot2)
library(tidyr)
c('Species', 'Sepal.Length', 'Sepal.Width')] |>
iris[, pivot_longer(cols = c(Sepal.Length, Sepal.Width)) |>
ggplot() +
geom_boxplot(aes(Species, value, fill = name)) +
labs(y = 'Size (cm)', fill = '')
library(gplots)
par(mfrow = c(1, 2))
plotmeans(iris$Sepal.Length ~ iris$Species, xlab = "Species", ylab = "Sepal length")
plotmeans(iris$Sepal.Width ~ iris$Species, xlab = "Species", ylab = "Sepal width")
Hypothesis: multivariate normality test
- \(H_0\): The population means of the sepal length and the sepal width are not different across the species.
# Summary MANOVA result with different test method
<- cbind(iris$Sepal.Length, iris$Sepal.Width)
SepalSize <- manova(SepalSize ~ iris$Species) iris_manova
summary(iris_manova, test = 'Pillai') # default
Df Pillai approx F num Df den Df Pr(>F)
iris$Species 2 0.94531 65.878 4 294 < 2.2e-16 ***
Residuals 147
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(iris_manova, test = 'Wilks')
Df Wilks approx F num Df den Df Pr(>F)
iris$Species 2 0.16654 105.88 4 292 < 2.2e-16 ***
Residuals 147
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(iris_manova, test = 'Roy')
Df Roy approx F num Df den Df Pr(>F)
iris$Species 2 4.1718 306.63 2 147 < 2.2e-16 ***
Residuals 147
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(iris_manova, test = 'Hotelling-Lawley')
Df Hotelling-Lawley approx F num Df den Df Pr(>F)
iris$Species 2 4.3328 157.06 4 290 < 2.2e-16 ***
Residuals 147
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Univariate ANOVAs for each dependent variable
summary.aov(iris_manova)
Response 1 :
Df Sum Sq Mean Sq F value Pr(>F)
iris$Species 2 63.212 31.606 119.26 < 2.2e-16 ***
Residuals 147 38.956 0.265
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Response 2 :
Df Sum Sq Mean Sq F value Pr(>F)
iris$Species 2 11.345 5.6725 49.16 < 2.2e-16 ***
Residuals 147 16.962 0.1154
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion: The species has a statistically significant effect on the sepal width and sepal length.
8.4 Post-hoc test
Example: after One-way MANOVA gives a significant result, which group(s) is/are different from other(s)?
Hypothesis: Linear Discriminant Analysis (LDA)
# Visualization
library(MASS)
<- lda(iris$Species ~ SepalSize, CV = FALSE)
iris_lda <- data.frame(Species = iris$Species, lda = predict(iris_lda)$x)
plot_lda ggplot(plot_lda) + geom_point(aes(x = lda.LD1, y = lda.LD2, colour = Species))
Conclusion: The sepal size of the setosa species is different from other species.
8.5 Multivariate normality
8.5.1 Shapiro-Wilk test
Hypothesis:
\(H_0\): The variable follows a normal distribution
library(rstatix)
|>
iris group_by(Species) |>
shapiro_test(Sepal.Length, Sepal.Width)
# A tibble: 6 × 4
Species variable statistic p
<fct> <chr> <dbl> <dbl>
1 setosa Sepal.Length 0.978 0.460
2 setosa Sepal.Width 0.972 0.272
3 versicolor Sepal.Length 0.978 0.465
4 versicolor Sepal.Width 0.974 0.338
5 virginica Sepal.Length 0.971 0.258
6 virginica Sepal.Width 0.967 0.181
Tips:
- If the sample size is large (say n > 50), the visual approaches such as QQ-plot and histogram will be better for assessing the normality assumption.
c('Species', 'Sepal.Length', 'Sepal.Width')] |>
iris[, pivot_longer(cols = c(Sepal.Length, Sepal.Width)) |>
ggplot() +
geom_histogram(aes(value)) +
facet_grid(name ~ Species)
Conclusion: As \(p>0.05\), the sepal length and the width for each species are normally distributed.
8.5.2 Mardia’s skewness and kurtosis test
Hypothesis:
\(H_0\): The variables follow a multivariate normal distribution
library(mvnormalTest)
mardia(iris[, c('Sepal.Length', 'Sepal.Width')])$mv.test
Test Statistic p-value Result
1 Skewness 9.4614 0.0505 YES
2 Kurtosis -0.691 0.4896 YES
3 MV Normality <NA> <NA> YES
Tip:
- When n > 20 for each combination of the independent and dependent variable, the multivariate normality can be assumed (Multivariate Central Limit Theorem).
Conclusion: As \(p>0.05\), the sepal length and the width follow a multivariate normal distribution.
8.5.3 Homogeneity of the variance-covariance matrix
Main:
Box’s M test: Use a lower \(\alpha\) level such as \(\alpha = 0.001\) to assess the \(p\) value for significance.
Hypothesis:
\(H_0\): The variance-covariance matrices are equal for each combination formed by each group in the independent variable.
library(biotools)
boxM(cbind(iris$Sepal.Length, iris$Sepal.Width), iris$Species)
Box's M-test for Homogeneity of Covariance Matrices
data: cbind(iris$Sepal.Length, iris$Sepal.Width)
Chi-Sq (approx.) = 35.655, df = 6, p-value = 3.217e-06
Conclusion: As \(p < 0.001\), the variance-covariance matrices for the sepal length and width are not equal for each combination formed by each species.
8.5.4 Multivariate outliers
- MANOVA is highly sensitive to outliers and may produce Type I or II errors.
- Multivariate outliers can be detected using the Mahalanobis Distance test. The larger the Mahalanobis Distance, the more likely it is an outlier.
library(rstatix)
<- mahalanobis_distance(iris[, c('Sepal.Length', 'Sepal.Width')])
iris_outlier head(iris_outlier, 5)
Sepal.Length Sepal.Width mahal.dist is.outlier
1 5.1 3.5 1.646 FALSE
2 4.9 3.0 1.369 FALSE
3 4.7 3.2 1.934 FALSE
4 4.6 3.1 2.261 FALSE
5 5.0 3.6 2.321 FALSE
8.5.5 Linearity
- Or test the regression or the slope (ENV221)
# Visualize the pairwise scatterplot for the dependent variable for each group
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
geom_smooth(method = 'lm') +
facet_wrap(Species ~ .)
8.5.6 Multicollinearity
Correlation between the dependent variable.
For three or more dependent variables, use a correlation matrix or variance inflation factor (VIF).
# Test the correlation
cor.test(x = iris$Sepal.Length, y = iris$Sepal.Width)
Pearson's product-moment correlation
data: iris$Sepal.Length and iris$Sepal.Width
t = -1.4403, df = 148, p-value = 0.1519
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.27269325 0.04351158
sample estimates:
cor
-0.1175698
# Visualization
ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
geom_point() +
geom_smooth(method = 'lm')
- If \(|r|\) > 0.9, there is multicollinearity.
- If r is too low, perform separate univariate ANOVA for each dependent variable.
8.6 Two-way MANOVA
Example: Plastic. Do the rate of extrusion and the additive have influence on the plastic quality?
# Summary MANOVA result
data('Plastic', package = 'heplots')
<- as.matrix(Plastic[, c('tear','gloss','opacity')])
Plastic_matrix <- manova(Plastic_matrix ~ Plastic$rate * Plastic$additive)
Plastic_manova summary(Plastic_manova)
Df Pillai approx F num Df den Df Pr(>F)
Plastic$rate 1 0.61814 7.5543 3 14 0.003034 **
Plastic$additive 1 0.47697 4.2556 3 14 0.024745 *
Plastic$rate:Plastic$additive 1 0.22289 1.3385 3 14 0.301782
Residuals 16
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Univariate ANOVAs for each dependent variable
summary.aov(Plastic_manova)
Response tear :
Df Sum Sq Mean Sq F value Pr(>F)
Plastic$rate 1 1.7405 1.74050 15.7868 0.001092 **
Plastic$additive 1 0.7605 0.76050 6.8980 0.018330 *
Plastic$rate:Plastic$additive 1 0.0005 0.00050 0.0045 0.947143
Residuals 16 1.7640 0.11025
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Response gloss :
Df Sum Sq Mean Sq F value Pr(>F)
Plastic$rate 1 1.3005 1.30050 7.9178 0.01248 *
Plastic$additive 1 0.6125 0.61250 3.7291 0.07139 .
Plastic$rate:Plastic$additive 1 0.5445 0.54450 3.3151 0.08740 .
Residuals 16 2.6280 0.16425
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Response opacity :
Df Sum Sq Mean Sq F value Pr(>F)
Plastic$rate 1 0.421 0.4205 0.1036 0.7517
Plastic$additive 1 4.901 4.9005 1.2077 0.2881
Plastic$rate:Plastic$additive 1 3.960 3.9605 0.9760 0.3379
Residuals 16 64.924 4.0578
9 ANCOVA
9.1 Definition of ANCOVA
Test whether the independent variable(s) has a significant influence on the dependent variable, excluding the influence of the covariate (preferably highly correlated with the dependent variable) \[Y_{ij} = (\mu+\tau_{i})+\beta(x_{ij}-\bar{x})+\epsilon_{ij}\]
- \(Y_{ij}\): the j-th observation under the i-th categorical group
- \(\mu\): the population mean
- \(i\): groups, 1,2, …
- \(j\): observations, 1,2,…
- \(\tau_i\): an adjustment to the y intercept for the i-th regression line
- \(\mu + \tau_i\): the intercept for group i
- \(\beta\): the slope of the line
- \(x_{ij}\): the j-th observation of the continuous variable under the i-th group
- \(\bar x\): the global mean of the variable x
- \(\epsilon _{ij}\): the associated unobserved error
Analysis of covariance (ANCOVA):
- Dependent variable (DV): One continuous variable
- Independent variables (IVs): One or multiple categorical variables, one or multiple continuous variables (covariate, CV)
Covariate (CV):
- An independent variable that is not manipulated by experimenters but still influences experimental results.
Example:
9.2 One-way ANCOVA
9.2.1 Question
Example 1: Does grazing have influence on the fruit production? Are grazed plants have more fruit production than ungrazed ones?
- Independent variable:
- Grazing (categorical)
- Dependent variable:
- Fruit production (continuous)
<- read.table("data/ipomopsis.txt", header = TRUE, stringsAsFactors = TRUE)
df1 head(df1, 5)
Root Fruit Grazing
1 6.225 59.77 Ungrazed
2 6.487 60.98 Ungrazed
3 4.919 14.73 Ungrazed
4 5.130 19.28 Ungrazed
5 5.417 34.25 Ungrazed
tapply(df1$Fruit,df1$Grazing, mean)
Grazed Ungrazed
67.9405 50.8805
library(ggplot2)
ggplot(df1) + geom_boxplot(aes(Fruit, Grazing))
# Hypothesis test
t.test(Fruit ~ Grazing, data = df1, alternative = c("greater"))
Welch Two Sample t-test
data: Fruit by Grazing
t = 2.304, df = 37.306, p-value = 0.01344
alternative hypothesis: true difference in means between group Grazed and group Ungrazed is greater than 0
95 percent confidence interval:
4.570757 Inf
sample estimates:
mean in group Grazed mean in group Ungrazed
67.9405 50.8805
Example 2: What is the influence of grazing and root diameter on the fruit production of a plant?
Independent variables:
- grazing (categorical: grazed or ungrazed)
- root diameter (continuous, mm, covariate)
Dependent variable:
- fruit production (mg)
# Visualization
ggplot(df1, aes(Root, Fruit))+
geom_point() +
geom_smooth(method = 'lm') +
geom_point(aes(color = Grazing)) +
geom_smooth(aes(color = Grazing), method = 'lm')
9.2.2 Maximal model
Symbol | Meaning |
---|---|
~ |
Separating DV (left) and IV (right) |
: |
Interaction effect of two factors |
* |
Main effect of the two factors and the interaction effect. f1 * f2 is equivalent to f1 + f2 + f1:f2 |
^ |
Square the sum of several terms. The main effect of these terms and the interaction between them |
. |
All variables except the DV |
# The maximal model
<- lm(Fruit ~ Grazing * Root, data = df1)
df1_ancova summary(df1_ancova)
Call:
lm(formula = Fruit ~ Grazing * Root, data = df1)
Residuals:
Min 1Q Median 3Q Max
-17.3177 -2.8320 0.1247 3.8511 17.1313
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -125.173 12.811 -9.771 1.15e-11 ***
GrazingUngrazed 30.806 16.842 1.829 0.0757 .
Root 23.240 1.531 15.182 < 2e-16 ***
GrazingUngrazed:Root 0.756 2.354 0.321 0.7500
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.831 on 36 degrees of freedom
Multiple R-squared: 0.9293, Adjusted R-squared: 0.9234
F-statistic: 157.6 on 3 and 36 DF, p-value: < 2.2e-16
# The ANOVA table for the maximal model
anova(df1_ancova)
Analysis of Variance Table
Response: Fruit
Df Sum Sq Mean Sq F value Pr(>F)
Grazing 1 2910.4 2910.4 62.3795 2.262e-09 ***
Root 1 19148.9 19148.9 410.4201 < 2.2e-16 ***
Grazing:Root 1 4.8 4.8 0.1031 0.75
Residuals 36 1679.6 46.7
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# other method to see the ANOVA table
<- aov(Fruit ~ Grazing * Root, data = df1)
df1_aov summary(df1_aov)
Df Sum Sq Mean Sq F value Pr(>F)
Grazing 1 2910 2910 62.380 2.26e-09 ***
Root 1 19149 19149 410.420 < 2e-16 ***
Grazing:Root 1 5 5 0.103 0.75
Residuals 36 1680 47
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov(df1_ancova)
Df Sum Sq Mean Sq F value Pr(>F)
Grazing 1 2910 2910 62.380 2.26e-09 ***
Root 1 19149 19149 410.420 < 2e-16 ***
Grazing:Root 1 5 5 0.103 0.75
Residuals 36 1680 47
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
9.2.3 Minimal model
# Delete the interaction factor
<- update(df1_ancova, ~ . - Grazing:Root)
df1_ancova2 summary(df1_ancova2)
Call:
lm(formula = Fruit ~ Grazing + Root, data = df1)
Residuals:
Min 1Q Median 3Q Max
-17.1920 -2.8224 0.3223 3.9144 17.3290
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -127.829 9.664 -13.23 1.35e-15 ***
GrazingUngrazed 36.103 3.357 10.75 6.11e-13 ***
Root 23.560 1.149 20.51 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.747 on 37 degrees of freedom
Multiple R-squared: 0.9291, Adjusted R-squared: 0.9252
F-statistic: 242.3 on 2 and 37 DF, p-value: < 2.2e-16
# Compare the simplified model with the maximal model
anova(df1_ancova, df1_ancova2)
Analysis of Variance Table
Model 1: Fruit ~ Grazing * Root
Model 2: Fruit ~ Grazing + Root
Res.Df RSS Df Sum of Sq F Pr(>F)
1 36 1679.7
2 37 1684.5 -1 -4.8122 0.1031 0.75
# Delete the grazing factor
<- update(df1_ancova2, ~ . - Grazing)
df1_ancova3 summary(df1_ancova3)
Call:
lm(formula = Fruit ~ Root, data = df1)
Residuals:
Min 1Q Median 3Q Max
-29.3844 -10.4447 -0.7574 10.7606 23.7556
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41.286 10.723 -3.850 0.000439 ***
Root 14.022 1.463 9.584 1.1e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.52 on 38 degrees of freedom
Multiple R-squared: 0.7073, Adjusted R-squared: 0.6996
F-statistic: 91.84 on 1 and 38 DF, p-value: 1.099e-11
# Compare the two models
anova(df1_ancova2, df1_ancova3)
Analysis of Variance Table
Model 1: Fruit ~ Grazing + Root
Model 2: Fruit ~ Root
Res.Df RSS Df Sum of Sq F Pr(>F)
1 37 1684.5
2 38 6948.8 -1 -5264.4 115.63 6.107e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(df1_ancova2)
Call:
lm(formula = Fruit ~ Grazing + Root, data = df1)
Residuals:
Min 1Q Median 3Q Max
-17.1920 -2.8224 0.3223 3.9144 17.3290
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -127.829 9.664 -13.23 1.35e-15 ***
GrazingUngrazed 36.103 3.357 10.75 6.11e-13 ***
Root 23.560 1.149 20.51 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.747 on 37 degrees of freedom
Multiple R-squared: 0.9291, Adjusted R-squared: 0.9252
F-statistic: 242.3 on 2 and 37 DF, p-value: < 2.2e-16
anova(df1_ancova2)
Analysis of Variance Table
Response: Fruit
Df Sum Sq Mean Sq F value Pr(>F)
Grazing 1 2910.4 2910.4 63.929 1.397e-09 ***
Root 1 19148.9 19148.9 420.616 < 2.2e-16 ***
Residuals 37 1684.5 45.5
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
9.2.4 One step
Criterion: Akaike’s information criterion (AIC). The model is worse if AIC gets greater.
step(df1_ancova)
Start: AIC=157.5
Fruit ~ Grazing * Root
Df Sum of Sq RSS AIC
- Grazing:Root 1 4.8122 1684.5 155.61
<none> 1679.7 157.50
Step: AIC=155.61
Fruit ~ Grazing + Root
Df Sum of Sq RSS AIC
<none> 1684.5 155.61
- Grazing 1 5264.4 6948.8 210.30
- Root 1 19148.9 20833.4 254.22
Call:
lm(formula = Fruit ~ Grazing + Root, data = df1)
Coefficients:
(Intercept) GrazingUngrazed Root
-127.83 36.10 23.56
9.2.5 Result
# Extracting formulas from linear regression models
::extract_eq(df1_ancova2, use_coefs = TRUE) equatiomatic
\[ \operatorname{\widehat{Fruit}} = -127.83 + 36.1(\operatorname{Grazing}_{\operatorname{Ungrazed}}) + 23.56(\operatorname{Root}) \]
# Create a diagnostic statistical data text table
::stargazer(df1_ancova2, type = 'text') stargazer
===============================================
Dependent variable:
---------------------------
Fruit
-----------------------------------------------
GrazingUngrazed 36.103***
(3.357)
Root 23.560***
(1.149)
Constant -127.829***
(9.664)
-----------------------------------------------
Observations 40
R2 0.929
Adjusted R2 0.925
Residual Std. Error 6.747 (df = 37)
F Statistic 242.272*** (df = 2; 37)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
The meaning of each parameter in this table
- Dependent variable: The name of the dependent variable. (因变量的名称)
- Independent variables: The names of the independent variables. (自变量的名称)
- Coefficients: Regression coefficients, indicating the degree to which an independent variable affects the dependent variable when it increases by one unit. (回归系数,表示自变量每增加一个单位对因变量的影响程度)
- Standard errors: Standard error of regression coefficients, measuring the stability of regression coefficients. (回归系数的标准误差,衡量回归系数的稳定性)
- t-statistics: t-value of regression coefficients, representing whether a regression coefficient is significantly different from zero and has a significant impact on the dependent variable. (回归系数的t值,代表回归系数是否显著不为0,即是否对因变量有显著影响)
- p-values: Significance level of regression coefficients, usually used to determine whether a regression coefficient is significantly different from zero. The smaller the p-value, the more significant the regression coefficient is considered to be. (回归系数的显著性水平,通常用于判断回归系数是否显著不为0,p值越小,表示回归系数越显著)
- Observations: Sample size. (样本数量)
- R2: Goodness-of-fit measure that represents how much variance in explanatory variables can be explained by model. A higher value indicates better fit between model and data. (拟合优度,表示模型解释变量方差的比例,数值越高表示模型拟合程度越好)
- Adjusted R2:A modified version of R2 that takes into account number of independent variables for greater accuracy. (调整后的拟合优度,考虑到自变量的个数,比R2更准确)
- Residual standard error:Standard deviation or dispersion measure for residuals; smaller values indicate better model fit. (残差标准误,表示残差的离散程度,越小表示模型越好)
- F-statistic:Statistical test used to evaluate overall goodness-of-fit for linear models. (F统计量,用于检验模型整体拟合优度是否显著)
- df:Degrees of freedom. (自由度)
- Note:p<0.1,p<0.05,p<0.01 : When p-value is less than 0.1, 0.05 or 0.01 respectively,* , ** , *** are used as symbols indicating significance levels for corresponding regressions coefficients. (p<0.1, p<0.05, p<0.01:p值小于0.1,0.05,0.01时,分别用,,表示,代表回归系数的显著性水平)
9.3 Two-way ANCOVA
9.3.1 Question
Previous experiments have shown that both genotype and sex of an organism affect body weight gain. However, a scientist believes that after adjusting for age, there was no significant difference in means of weight gain between groups at different levels of sex and Genotype. Can experiments support this claim?
- Independent variables:
- genotype (categorical)
- sex (categorical)
- age (covariate)
- Dependent variable:
- Weight gain (continuous)
9.3.2 Model
<- read.table("data/Gain.txt", header = T)
Gain head(Gain, 3)
Weight Sex Age Genotype Score
1 7.445630 male 1 CloneA 4
2 8.000223 male 2 CloneA 4
3 7.705105 male 3 CloneA 4
<- lm(Weight ~ Sex * Age * Genotype, data = Gain)
m1 summary(m1)
Call:
lm(formula = Weight ~ Sex * Age * Genotype, data = Gain)
Residuals:
Min 1Q Median 3Q Max
-0.40218 -0.12043 -0.01065 0.12592 0.44687
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.80053 0.24941 31.276 < 2e-16 ***
Sexmale -0.51966 0.35272 -1.473 0.14936
Age 0.34950 0.07520 4.648 4.39e-05 ***
GenotypeCloneB 1.19870 0.35272 3.398 0.00167 **
GenotypeCloneC -0.41751 0.35272 -1.184 0.24429
GenotypeCloneD 0.95600 0.35272 2.710 0.01023 *
GenotypeCloneE -0.81604 0.35272 -2.314 0.02651 *
GenotypeCloneF 1.66851 0.35272 4.730 3.41e-05 ***
Sexmale:Age -0.11283 0.10635 -1.061 0.29579
Sexmale:GenotypeCloneB -0.31716 0.49882 -0.636 0.52891
Sexmale:GenotypeCloneC -1.06234 0.49882 -2.130 0.04010 *
Sexmale:GenotypeCloneD -0.73547 0.49882 -1.474 0.14906
Sexmale:GenotypeCloneE -0.28533 0.49882 -0.572 0.57087
Sexmale:GenotypeCloneF -0.19839 0.49882 -0.398 0.69319
Age:GenotypeCloneB -0.10146 0.10635 -0.954 0.34643
Age:GenotypeCloneC -0.20825 0.10635 -1.958 0.05799 .
Age:GenotypeCloneD -0.01757 0.10635 -0.165 0.86970
Age:GenotypeCloneE -0.03825 0.10635 -0.360 0.72123
Age:GenotypeCloneF -0.05512 0.10635 -0.518 0.60743
Sexmale:Age:GenotypeCloneB 0.15469 0.15040 1.029 0.31055
Sexmale:Age:GenotypeCloneC 0.35322 0.15040 2.349 0.02446 *
Sexmale:Age:GenotypeCloneD 0.19227 0.15040 1.278 0.20929
Sexmale:Age:GenotypeCloneE 0.13203 0.15040 0.878 0.38585
Sexmale:Age:GenotypeCloneF 0.08709 0.15040 0.579 0.56616
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2378 on 36 degrees of freedom
Multiple R-squared: 0.9742, Adjusted R-squared: 0.9577
F-statistic: 59.06 on 23 and 36 DF, p-value: < 2.2e-16
- There are no things like Age:Sex or Age:Genotype, so the slope of weight gain against age does not vary with sex or genotype
- In the final minimal adequate model, three main effects were included (Sex, Age, Genotype), so it can be considered that there are intercept differences between gender, age and genotype (intercepts vary).
- The final minimal adequate model includes three main effects (Sex, Age, Genotype), but no interaction effect. This means that the effects of these variables are independent and there is no interaction between them.
9.3.3 One step
<- step(m1) m2
Start: AIC=-155.01
Weight ~ Sex * Age * Genotype
Df Sum of Sq RSS AIC
- Sex:Age:Genotype 5 0.34912 2.3849 -155.51
<none> 2.0358 -155.01
Step: AIC=-155.51
Weight ~ Sex + Age + Genotype + Sex:Age + Sex:Genotype + Age:Genotype
Df Sum of Sq RSS AIC
- Sex:Genotype 5 0.146901 2.5318 -161.92
- Age:Genotype 5 0.168136 2.5531 -161.42
- Sex:Age 1 0.048937 2.4339 -156.29
<none> 2.3849 -155.51
Step: AIC=-161.92
Weight ~ Sex + Age + Genotype + Sex:Age + Age:Genotype
Df Sum of Sq RSS AIC
- Age:Genotype 5 0.168136 2.7000 -168.07
- Sex:Age 1 0.048937 2.5808 -162.78
<none> 2.5318 -161.92
Step: AIC=-168.07
Weight ~ Sex + Age + Genotype + Sex:Age
Df Sum of Sq RSS AIC
- Sex:Age 1 0.049 2.749 -168.989
<none> 2.700 -168.066
- Genotype 5 54.958 57.658 5.612
Step: AIC=-168.99
Weight ~ Sex + Age + Genotype
Df Sum of Sq RSS AIC
<none> 2.749 -168.989
- Sex 1 10.374 13.122 -77.201
- Age 1 10.770 13.519 -75.415
- Genotype 5 54.958 57.707 3.662
summary(m2)
Call:
lm(formula = Weight ~ Sex + Age + Genotype, data = Gain)
Residuals:
Min 1Q Median 3Q Max
-0.40005 -0.15120 -0.01668 0.16953 0.49227
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.93701 0.10066 78.851 < 2e-16 ***
Sexmale -0.83161 0.05937 -14.008 < 2e-16 ***
Age 0.29958 0.02099 14.273 < 2e-16 ***
GenotypeCloneB 0.96778 0.10282 9.412 8.07e-13 ***
GenotypeCloneC -1.04361 0.10282 -10.149 6.21e-14 ***
GenotypeCloneD 0.82396 0.10282 8.013 1.21e-10 ***
GenotypeCloneE -0.87540 0.10282 -8.514 1.98e-11 ***
GenotypeCloneF 1.53460 0.10282 14.925 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2299 on 52 degrees of freedom
Multiple R-squared: 0.9651, Adjusted R-squared: 0.9604
F-statistic: 205.7 on 7 and 52 DF, p-value: < 2.2e-16
From the above output, we can see the coefficients of each genotype and their corresponding significance level (in the Estimate column). It can be observed that GenotypeCloneC and GenotypeCloneE have similar effects on the dependent variable, with coefficients close to -1 and very small significance levels (p-value < 0.001). Therefore, we can combine these two factors into one factor, reducing the number of genotype levels from six to five. The same approach can be applied to B and D.
# Check
<- as.factor(Gain$Genotype)
newGenotype levels(newGenotype)
[1] "CloneA" "CloneB" "CloneC" "CloneD" "CloneE" "CloneF"
# Change
levels(newGenotype)[c(3,5)] <- "ClonesCandE"
levels(newGenotype)[c(2,4)] <- "ClonesBandD"
levels(newGenotype)
[1] "CloneA" "ClonesBandD" "ClonesCandE" "CloneF"
# Liner regression & compare
<- lm(Weight ~ Sex + Age + newGenotype, data = Gain)
m3 anova(m2,m3)
Analysis of Variance Table
Model 1: Weight ~ Sex + Age + Genotype
Model 2: Weight ~ Sex + Age + newGenotype
Res.Df RSS Df Sum of Sq F Pr(>F)
1 52 2.7489
2 54 2.9938 -2 -0.24489 2.3163 0.1087
Analysis of RSS above
In regression models, the residual sum of squares (RSS) represents the unexplained variance in the dependent variable. In this example, the RSS for Model 1 and Model 2 are 2.7489 and 2.9938 respectively. The goal of a model is to minimize RSS because it represents how well the model fits with observed values. Generally, a smaller RSS indicates a better fit for the model. When comparing two models’ fit, one can use both RSS and F-statistic. In this example, although Model 2 has a slightly larger RSS than Model 1, its P-value for F-statistic is 0.1087 which does not reach commonly used significance levels such as 0.05; therefore we cannot reject null hypothesis that Model 2 performs worse than Model 1.9.3.4 Result
As \(p=0.1087\), there is no significant difference between the two models. Therefore, the new model uses NewGenotype (four levels) instead of Genotype (six levels).
summary(m3)
Call:
lm(formula = Weight ~ Sex + Age + newGenotype, data = Gain)
Residuals:
Min 1Q Median 3Q Max
-0.42651 -0.16687 0.01211 0.18776 0.47736
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.93701 0.10308 76.996 < 2e-16 ***
Sexmale -0.83161 0.06080 -13.679 < 2e-16 ***
Age 0.29958 0.02149 13.938 < 2e-16 ***
newGenotypeClonesBandD 0.89587 0.09119 9.824 1.28e-13 ***
newGenotypeClonesCandE -0.95950 0.09119 -10.522 1.10e-14 ***
newGenotypeCloneF 1.53460 0.10530 14.574 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2355 on 54 degrees of freedom
Multiple R-squared: 0.962, Adjusted R-squared: 0.9585
F-statistic: 273.7 on 5 and 54 DF, p-value: < 2.2e-16
# Extracting formulas from linear regression models
::extract_eq(m3, use_coefs = TRUE) equatiomatic
\[ \operatorname{\widehat{Weight}} = 7.94 - 0.83(\operatorname{Sex}_{\operatorname{male}}) + 0.3(\operatorname{Age}) + 0.9(\operatorname{newGenotype}_{\operatorname{ClonesBandD}}) - 0.96(\operatorname{newGenotype}_{\operatorname{ClonesCandE}}) + 1.53(\operatorname{newGenotype}_{\operatorname{CloneF}}) \]
# Create a diagnostic statistical data text table
::stargazer(m3, type = 'text') stargazer
==================================================
Dependent variable:
---------------------------
Weight
--------------------------------------------------
Sexmale -0.832***
(0.061)
Age 0.300***
(0.021)
newGenotypeClonesBandD 0.896***
(0.091)
newGenotypeClonesCandE -0.960***
(0.091)
newGenotypeCloneF 1.535***
(0.105)
Constant 7.937***
(0.103)
--------------------------------------------------
Observations 60
R2 0.962
Adjusted R2 0.959
Residual Std. Error 0.235 (df = 54)
F Statistic 273.652*** (df = 5; 54)
==================================================
Note: *p<0.1; **p<0.05; ***p<0.01
9.3.5 Visualization
plot(Weight~Age,data=Gain,type="n")
<- c("green","red","black","blue")
colours <- c(1,2)
lines <- c(16,17)
symbols <-as.factor(Gain$Sex)
NewSexpoints(Weight~Age,data=Gain,pch=symbols[as.numeric(NewSex)],
col=colours[as.numeric(newGenotype)])
<- c(1,5)
xv for (i in 1:2) {
for (j in 1:4){
<- coef(m3)[1]+(i>1)* coef(m3)[2]+(j>1)*coef(m3)[j+2]
a
<- coef(m3)[3]
b <- a+b*xv
yv lines(xv,yv,lty=lines[i],col=colours[j]) } }
10 MANCOVA
- Dependent variables: multiple continuous variables
- Independent variables: one or multiple categorical variables, one or multiple continuous variables (covariates)
10.1 Definition of MANCOVA
Multivariate Analysis of Covariance (MANCOVA) = multivariate ANCOVA = MANOVA with covariate(s)
Analysis for the differences among group means for a linear combination of the dependent variables after adjusted for the covariate. Test whether the independent variable(s) has a significant influence on the dependent variables, excluding the influence of the covariate (preferably highly correlated with the dependent variable)
- Independent Random Sampling: Independence of observations from all other observations.
- Level and Measurement of the Variables: The independent variables are categorical and the dependent variables are continuous or scale variables. Covariates are continuous.
- Homogeneity of Variance: Variance between groups is equal.
- Normality: For each group, each dependent variable follows a normal distribution and any linear combination of dependent variables are normally distributed
Brief explanation in Chinese
在MANCOVA中,我们有多个因变量(即连续或比例变量),一个或多个自变量(即分类变量),以及一个或多个协变量(即连续变量)。这些变量的测量水平应该正确,并且观察值应该是独立随机采样的。这意味着,我们要确保观察值彼此独立,不受其他变量的影响。同时,我们需要检查每个组的因变量是否都符合正态分布,并且方差应该相等。如果我们的数据满足这些假设,则可以使用MANCOVA来探索自变量对因变量的影响,并且通过协变量来控制一些其他影响因素的影响。MANCOVA可以更准确地确定组间差异是否真实存在,而不是基于单个因变量的分析来做出判断。 MANCOVA通常用于实验设计,研究人员想要比较多个组的平均得分,同时控制其他因素的影响。
10.2 Workflow
10.2.1 Question
Example: Are there differences in productivity (measured by income and hours worked) for individuals in different age groups after adjusted for the education level?
- Dependent variables:
- wage (continuous)
- age (continuous)
- Independent variables:
- education (categorical)
- year (continuous, covariate)
10.2.2 Visualization
library(tidyverse)
library(ISLR)
library(car)
ggplot(Wage, aes(age, wage)) + geom_point(alpha = 0.3) +
geom_smooth(method = lm) + facet_grid(year~education)
10.2.3 Model
manova()
way
<- manova(cbind(wage, age) ~ education * year, data = Wage)
wage_manova1 wage_manova1
Call:
manova(cbind(wage, age) ~ education * year, data = Wage)
Terms:
education year education:year Residuals
wage 1226364 16807 2404 3976510
age 4608 494 724 393723
Deg. of Freedom 4 1 4 2990
Residual standard errors: 36.4683 11.47519
Estimated effects may be unbalanced
summary.aov(wage_manova1)
Response wage :
Df Sum Sq Mean Sq F value Pr(>F)
education 4 1226364 306591 230.5306 < 2.2e-16 ***
year 1 16807 16807 12.6374 0.000384 ***
education:year 4 2404 601 0.4519 0.771086
Residuals 2990 3976510 1330
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Response age :
Df Sum Sq Mean Sq F value Pr(>F)
education 4 4608 1151.90 8.7477 5.108e-07 ***
year 1 494 494.13 3.7525 0.05282 .
education:year 4 724 180.90 1.3738 0.24043
Residuals 2990 393723 131.68
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Brief explanation in Chinese
根据结果,教育和年份两个因子对 wage 有显著影响,因为它们的 P 值均小于0.05。但是,交互作用项 “education:year” 的 P 值大于0.05,表明这个交互项对 wage 没有显著影响。对于年龄变量,教育和(年份≈0.05)都对其有显著影响,因为它们的 P 值小于0.05。然而,交互项 “education:year” 对 age 没有显著影响,因为其 P 值大于0.05。
jmv::mancova()
way
library(jmv)
<- jmv::mancova(data = Wage,
wage_manova2 deps = vars(wage, age),
factors = education,
covs = year)
wage_manova2
MANCOVA
Multivariate Tests
─────────────────────────────────────────────────────────────────────────────────────────────
value F df1 df2 p
─────────────────────────────────────────────────────────────────────────────────────────────
education Pillai's Trace 0.240590546 102.353675 8 5988 < .0000001
Wilks' Lambda 0.7605539 109.739015 8 5986 < .0000001
Hotelling's Trace 0.313326457 117.184095 8 5984 < .0000001
Roy's Largest Root 0.308447997 230.873326 4 2994 < .0000001
year Pillai's Trace 0.004790217 7.203065 2 2993 0.0007573
Wilks' Lambda 0.9952098 7.203065 2 2993 0.0007573
Hotelling's Trace 0.004813274 7.203065 2 2993 0.0007573
Roy's Largest Root 0.004813274 7.203065 2 2993 0.0007573
─────────────────────────────────────────────────────────────────────────────────────────────
Univariate Tests
────────────────────────────────────────────────────────────────────────────────────────────────────────
Dependent Variable Sum of Squares df Mean Square F p
────────────────────────────────────────────────────────────────────────────────────────────────────────
education wage 1226364.4849 4 306591.1212 230.699570 < .0000001
age 4607.5852 4 1151.8963 8.743336 0.0000005
year wage 16806.9907 1 16806.9907 12.646699 0.0003821
age 494.1336 1 494.1336 3.750664 0.0528805
Residuals wage 3978914.2941 2994 1328.9627
age 394446.4359 2994 131.7456
────────────────────────────────────────────────────────────────────────────────────────────────────────
Brief explanation in Chinese
在 Multivariate Tests 表格中,我们可以看到 education 和 year 这两个自变量的多元假设检验结果。四种统计量都显示了这两个变量的组合对因变量 wage 和 age 有显著影响(p 值 < 0.0001)。在 Univariate Tests 表格中,我们可以看到每个因变量 wage 和 age 的单元假设检验结果。结果显示 education 对 wage 和 age 都有显著影响(p 值 < 0.0001),而 year 只对 wage 有显著影响(p 值 = 0.0004)。值得注意的是,age 的 year 效应在 p 值为 0.0529 的边缘。Residuals 列显示了每个因变量在模型中的残差方差。
总体来说,这个 MANCOVA 模型表明,在控制了 education 和 year 之后,wage 和 age 之间存在着相关性,并且 education 和 year 对这种关系都有显著的影响。
10.2.4 Result
- Since the interaction effect is not significant (p = 0.77 for salary and p = 0.24 for age), the slopes are parallel.
- The wage and age differ significantly among education groups (p for both wage and age are far below 0.05).
- Differences in salary also significantly (p = 0.00038) increase over time (variable year), due to some economic reasons, while differences in age don’t change (p = 0.053) much.
Brief explanation in Chinese
- 对于交互作用效应而言,它在wage和age方面的p值分别为0.771086和0.24043,因此交互作用效应不显著。这意味着不同教育水平组之间的工资和年龄差异的斜率是平行的。
- 对于MANOVA表格中的“Multivariate Tests”部分,education组之间的差异在wage和age方面都显著(p值均远远低于0.05)。
- 对于ANOVA表格中的“Univariate Tests”部分,随时间的推移,salary方面的差异显著增加(p=0.000384),而age方面的差异则没有显著变化(p=0.05282)。这表明时间变化是造成salary变化的一个重要原因,而不是年龄变化。
11 Combining statistics
11.1 Combining means and standard errors
Example: Two separate but similar experiments measuring the rate of glucose production of liver cells.
Tips: for different group experments we can’t just sum them up and simply calculate the mean and standard error.
- Experiment 1: 4.802, 3.81, 4.004, 4.467, 3.8
- Experiment 2: 5.404, 5.256, 4.145, 5.401, 5.622, 4.312
- Calculate the overall \(n, \bar x, se\)
If we know the row data:
# Row data
<- c(4.802, 3.81, 4.004, 4.467, 3.8)
x1 <- c(5.404, 5.256, 4.145, 5.401, 5.622, 4.312)
x2 # Form a new dataframe which contains Those columns
<- length(x1)
n1 <- length(x2)
n2 <- data.frame(n = c(n1, n2), mean = c(mean(x1), mean(x2)), sd = c(sd(x1), sd(x2)))
dtf $se <- dtf$sd/sqrt(dtf$n)
dtf dtf
n mean sd se
1 5 4.176600 0.4420043 0.1976703
2 6 5.023333 0.6288942 0.2567450
# Calculate the mean and standard error
<- c(x1, x2)
x <- mean(x)
x_bar <- length(x)
n <- sd(x)/sqrt(n)
se c(x_bar, se)
[1] 4.6384545 0.2070211
If we don’t know the row data:
<- sum(dtf$mean * dtf$n) / sum(dtf$n)
cmean <- sqrt((sum(dtf$n * ((dtf$n - 1)* dtf$se ^ 2 + dtf$mean ^ 2)) - sum(dtf$n * dtf$mean) ^ 2/ sum(dtf$n)) / (sum(dtf$n) * (sum(dtf$n) - 1)))
cse c(cmean, cse)
[1] 4.6384545 0.2070211
11.2 Mean and standard errors of sum and difference
11.2.1 Concepts
- First we have two variables \(p\) and \(q\)
- The third variable \(x = p + q\)
- The fourth variable \(y = p - q\)
- When \(n_p \neq n_q \neq n\): \[\bar x = \bar p + \bar q\] \[\bar y = \bar p - \bar q\] \[se_x = se_y = \sqrt{\frac{(n_p - 1)se_p^2 + (n_q - 1)se_q^2}{n_p + n_q - 2} \cdot \frac{n_p + n_q}{n_p n_q}}\]
- When \(n_p = n_q = n\): \[se_x = se_y = \sqrt{\frac{se_p^2 + se_q^2}{n}}\]
11.2.2 Example
A luciferase-based assay is being used to quantify the amount of ATP and ADP in small tissue samples. The amount of ATP (q) is measured directly in 8 samples as \(3.25 \pm 0.14 mol g^{-1}\). A further 10 samples are treated with pyruvate kinase plus phosphoenolpyruvate to convert ADP quantitatively to ATP. The total ATP (p) in these samples is determined to be \(4.56 \pm 0.29\mu mol g^{-1}\). The ADP content is \(p - q\).
What is the mean and standard error of ADP concentration?
# Dataframe of example experiment
<- data.frame(mean = c(3.25, 4.56), n = c(8, 10), se = c(0.14, 0.29))
x x
mean n se
1 3.25 8 0.14
2 4.56 10 0.29
<- diff(x$mean)
ADP <- sqrt(sum(x$se ^ 2 * (x$n - 1)) / (sum(x$n) - 2) * sum(x$n) / prod(x$n))
cse c(ADP, cse)
[1] 1.3100000 0.1121306
11.3 Mean and standard error of ratios and products
11.3.1 Concepts
- First we have two variables \(p\) and \(q\)
- The third variable \(x = p \cdot q\)
- The fourth variable \(y = p / q\) \[\bar x = \bar p \cdot \bar q\] \[\bar y = \bar p / \bar q\] \[se_x = \sqrt{\frac{\bar p^2 n_q se_q^2 + \bar q^2 n_p se_p^2 + n_p se_p^2 n_q se_q^2}{n_p + n_q - 2}}\] \[se_y = \frac{1}{\bar q} \sqrt {\frac{n_p se_p^2 + n_qse_q^2(\frac{\bar p}{\bar q})^2}{n_p + n_q - 2}}\]
11.3.2 Example
In the previous example, we got the concentrations of ATP and ADP in a tissue sample. what is the ratio of [ATP]/[ADP]?
<- data.frame(mean = c(3.25, ADP), n = c(8, 10), se = c(0.14, cse))
x x
mean n se
1 3.25 8 0.1400000
2 1.31 10 0.1121306
<- x$mean[1] / x$mean[2]
ratiox <- sqrt((x$n[1] * x$se[1] ^ 2 + x$n[2] * x$se[2] ^ 2 * ((x$mean[1] / x$mean[2]) ^ 2)) / (sum(x$n) - 2)) / x$mean[2]
cse c(ratiox, cse)
[1] 2.4809160 0.1841063
12 Non-parametric hypothesis tests
12.1 Definition
Non-parametric hypothesis tests:
A hypothesis test that does not require any specific conditions concerning the shapes of population distributions or the values of population parameters. When we don’t know the distribution of the population, the it is easier to perform than corresponding parametric tests but less efficient than parametric tests.
Wilcoxon Rank-Sum test: Used to compare whether the medians of two independent groups are equal. It is suitable for situations where the sample does not follow a normal distribution or where the variances are unequal.
Wilcoxon Signed-Rank test: Used to compare whether the medians of two related groups are equal. It is suitable for situations where the sample does not follow a normal distribution or where the variances are unequal.
Kruskal-Wallis test: Used to compare whether the medians of three or more independent groups are equal. It is suitable for situations where the sample does not follow a normal distribution or where the variances are unequal.
Friedman test: Used to compare whether the medians of three or more related groups are equal. It is suitable for situations where the sample does not follow a normal distribution or where the variances are unequal.
12.2 Wilcoxon Rank-Sum test
12.2.1 Concept
Wilcoxon Rank-Sum test, also known as the Mann-Whitney U test, is a non-parametric test for comparing the equality of the medians of two independent samples. Its original hypothesis is that the medians of the two samples are equal, and the alternative hypothesis is that the medians of the two samples are not equal.
- \(H_0\): the medians of the two populations are equal.
- Both samples are drawn randomly and independently.
- The measures within the two samples are able to be ranked and hence must be continuous, discrete or ordinal.
The basic idea of this test is to combine the two samples, rank them in order of size, and then calculate how many of the rankings in each sample are less than or equal to the value of that sample. The sum of these rankings is then used as the test statistic. If the medians of the two samples are equal, the test statistic should be close to half the sum of the total rankings. If the medians of the two samples are not equal, then the test statistic should be far from half the sum of the total rankings. \[U=\sum_{i=1}^n \sum_{j=1}^m S\left(X_i, Y_j\right)\] \[S(X, Y)= \begin{cases}1, & \text { if } X>Y \\ \frac{1}{2}, & \text { if } X=Y \\ 0, & \text { if } X<Y\end{cases}\]
- \(U\): The Wilcoxon Rank-Sum test statistic
- \(X_1, \ldots, X_n\): an independent sample from \(X\)
- \(Y_1, \ldots, Y_m\): an independent sample from \(Y\)
12.2.2 Example-1
Suppose we have two dataset x and y, calculate the U (U is the smaller rank sum of these two samples):
# Row data
<- c(3, 7)
x <- c(2, 4, 5, 8, 9)
y <- length(x)
m <- length(y)
n # One step
# The outer() function is the function used in R to perform various operations between two vectors,
# returning the result of all operations between all elements of the two vectors.
<- sum(outer(x, y, ">")) + sum(outer(x, y, "==")) * 0.5 + sum(outer(x, y, "<")) * 0
U U
[1] 4
The larger of the m and n, the U more like normal distribution, approximately normal distribution by:
\[\mu_U = mn/2\] \[s_U = \sqrt{\frac{mn(m + n + 1)}{12}}\]
Ties concept:
# Duplicate value ranking
rank(c(1, 2, 2, 3))
[1] 1.0 2.5 2.5 4.0
12.2.3 Workflow
# One step
wilcox.test(x, y)
Wilcoxon rank sum exact test
data: x and y
W = 4, p-value = 0.8571
alternative hypothesis: true location shift is not equal to 0
<- data.frame(value = c(x, y),
dtf name = rep(c('x', 'y'), c(m, n)))
wilcox.test(value ~ name, data = dtf)
Wilcoxon rank sum exact test
data: value by name
W = 4, p-value = 0.8571
alternative hypothesis: true location shift is not equal to 0
# Boxplot visualization
boxplot(x, y, horizontal = TRUE)
<- qwilcox(c(0.025, 0.975), m, n)
U_critical U_critical
[1] 0 10
pwilcox(U, m, n) * 2
[1] 0.8571429
# Probability density function plot visualization
curve(dwilcox(x, m, n), from = -1, to = 15, n = 17, ylab = 'Probability distribution density', las = 1, type = 'S', ylim = c(0, 0.15))
abline(h = 0)
segments(x0 = c(U, U_critical), y0 = 0, x1 = c(U, U_critical), y1 = dwilcox(c(U, U_critical), m, n), col = c('blue', 'red', 'red'))
legend('topright', legend = c('Distribution curve', 'Critical values', 'U score'), col = c('black', 'red', 'blue'), lty = 1, cex = 0.7)
12.2.4 Example-2
The insect spray dataset for C, D and F (just the pick C and D). The number of insects that survived when treated with insecticide treatments.
# Row data
<- read.csv("data/InsectSprays.csv")
insect <- insect[insect$spray %in% c("C", "D"), ]
insect # Visualization
boxplot(bugs ~ spray, data = insect, horizontal = TRUE)
Use Shapiro test to test each of the group we treat to see if they are normal distribution:
shapiro.test(insect$bugs[insect$spray == "C"])
Shapiro-Wilk normality test
data: insect$bugs[insect$spray == "C"]
W = 0.85907, p-value = 0.04759
shapiro.test(insect$bugs[insect$spray == "D"])
Shapiro-Wilk normality test
data: insect$bugs[insect$spray == "D"]
W = 0.75063, p-value = 0.002713
Instead, we could test whether the two medians are different:
tapply(insect$bugs, insect$spray, median)
C D
1.5 5.0
\(H_0\): The median of the survived insects treated by C is equal to that by D.
# Step by step
<- sum(insect$spray == 'C')
m <- sum(insect$spray == 'D')
n <- insect$bugs[insect$spray == 'C']
x <- insect$bugs[insect$spray == 'D']
y <- sum(outer(x, y, ">")) + sum(outer(x, y, "==")) * 0.5
U1 <- sum(outer(y, x, ">")) + sum(outer(y, x, "==")) * 0.5
U2 <- min(c(U1, U2))
U <- qwilcox(c(0.025, 0.975), m, n)
U_critical pwilcox(U, m, n) * 2
[1] 0.001829776
# Approximately z distribution
<- m * n / 2
U_mu <- sqrt(m * n * (m + n + 1) / 12)
U_sd <- (U - U_mu)/U_sd
z_score pnorm(z_score) * 2
[1] 0.002680172
# Determine if there is a significant difference between those two sprays
median(outer(insect$bugs[insect$spray == 'C'], insect$bugs[insect$spray == 'D'], "-"))
[1] -3
# One step
wilcox.test(bugs ~ spray, data = insect, conf.int=TRUE)
Wilcoxon rank sum test with continuity correction
data: bugs by spray
W = 20, p-value = 0.002651
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
-4.000018 -1.000009
sample estimates:
difference in location
-2.999922
The meaning of each parameter in this table
- W: the Wilcoxon test statistic U.
- We are 95% certain that the median difference between spray D and C across the population will be between 1 and 4 bugs.
- It does not estimate the difference in medians but rather the median of the difference (between the two samples)
Decision: Reject \(H_0\)
Conclusion: There is a significant difference in insecticide effectiveness.
12.3 Wilcoxon Signed-Rank test
12.3.1 Concept
Test whether the median of the observed differences deviates enough from zero, based on paired samples. Roughly a non-parametric version of paired t-test.
\(H_0\): the medians of the two populations are equal.
12.3.2 Example
Ten people take part in a weight-loss program. They weigh before starting the program, and weigh again after the one-month program. Does the program have effect on the weight?
# Row data
<- data.frame(
wl id = LETTERS[1:10],
before = c(198, 201, 210, 185, 204, 156, 167, 197, 220, 186),
after = c(194, 203, 200, 183, 200, 153, 166, 197, 215, 184))
# Step by step
<- nrow(wl)
n <- wl$after - wl$before
d <- rank(abs(d)) * abs(d) / d # signed ranks of the differences
R <- sum(R, na.rm = TRUE)
W <- sqrt(n * (n + 1) * (2 * n + 1) / 6) # population standard deviation of W
sW <- (abs(W) - 0.5) / sW
z <- qnorm(p = c(0.025, 0.975)) # alpha = 0.05
z_critical pnorm(z, lower.tail = FALSE) * 2
[1] 0.02040075
# One step
wilcox.test(wl$before, wl$after, paired = TRUE, conf.int = TRUE, correct = FALSE)
Wilcoxon signed rank test
data: wl$before and wl$after
V = 42, p-value = 0.02032
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
0.9999463 6.0000301
sample estimates:
(pseudo)median
3.00007
Decision: As \(p < 0.05\), reject \(H_0\).
Conclusion: The two population medians are different at \(\alpha = 0.05\). The program has effect on the weight.
12.4 Kruskal-Wallis test
12.4.1 Concept
Test whether three or more population medians are equal. Roughly a non-parametric version of ANOVA
\(H_0\): the medians of the populations are equal.
12.4.2 Example
The insect spray dataset for C, D, and F. The number of insects that survived when treated with insecticide treatments.
Do any difference in the number of insects that survived when treated with multiple available insecticide treatments?
# Row data
<- read.csv("data/InsectSprays.csv")
insect # Visualization
boxplot(bugs ~ spray, data = insect, horizontal = TRUE, notch = TRUE)
shapiro.test(insect$bugs[insect$spray == "C"]) # Normal distribution
Shapiro-Wilk normality test
data: insect$bugs[insect$spray == "C"]
W = 0.85907, p-value = 0.04759
shapiro.test(insect$bugs[insect$spray == "D"]) # Normal distribution
Shapiro-Wilk normality test
data: insect$bugs[insect$spray == "D"]
W = 0.75063, p-value = 0.002713
shapiro.test(insect$bugs[insect$spray == "F"]) # Not normal distribution
Shapiro-Wilk normality test
data: insect$bugs[insect$spray == "F"]
W = 0.88475, p-value = 0.1009
# Determine if there is a significant difference between those three sprays
tapply(insect$bugs, insect$spray, median)
C D F
1.5 5.0 15.0
# One step
kruskal.test(bugs ~ spray, data = insect)
Kruskal-Wallis rank sum test
data: bugs by spray
Kruskal-Wallis chi-squared = 26.866, df = 2, p-value = 1.466e-06
Decision: As \(p < 0.05\), reject H0.
Conclusion: There is a significant difference in insecticide effectiveness.
12.5 Friedman’s test
12.5.1 Concept
Test whether three or more population medians are equal for repeated measures. Roughly a non-parametric version of repeated measures ANOVA. More powerful than ANOVA for very skewed (heavy tail) distributions. \[F_{r}=\frac{12}{n k(k+1)}\left(T_{1}^{2}+T_{2}^{2}+\ldots+T_{k}^{2}\right)-3 n(k+1)\]
12.5.2 Example
# Row data
<- data.frame(
dtf before = c(198, 201, 210, 185, 204, 156, 167, 197, 220, 186),
one = c(194, 203, 200, 183, 200, 153, 166, 197, 215, 184),
two = c(191, 200, 192, 180, 195, 150, 167, 195, 209, 179),
three = c(188, 196, 188, 178, 191, 145, 166, 192, 205, 175)
)rownames(dtf) <- LETTERS[1:10]
<- ncol(dtf)
k <- nrow(dtf) n
Calculate the Friedman test statistic:
# Rank of each group
<- t(apply(dtf, MARGIN = 1, rank))
dtf_rank dtf_rank
before one two three
A 4.0 3.0 2.0 1.0
B 3.0 4.0 2.0 1.0
C 4.0 3.0 2.0 1.0
D 4.0 3.0 2.0 1.0
E 4.0 3.0 2.0 1.0
F 4.0 3.0 2.0 1.0
G 3.5 1.5 3.5 1.5
H 3.5 3.5 2.0 1.0
I 4.0 3.0 2.0 1.0
J 4.0 3.0 2.0 1.0
# Method-1 of Chi-square value
<- colSums(dtf_rank)
rank_sum <- 12 / (n * k * (k + 1)) * sum(rank_sum ^ 2) - 3 * n * (k + 1)
chi_sq # Method-2 of Chi-square value
<- colMeans(dtf_rank)
rank_mean <- mean(1:k)
rank_overall <- n * sum((rank_mean - rank_overall) ^ 2)
ssb <- 12 * ssb / (k * (k + 1))
chi_sq chi_sq
[1] 24.99
# Critical point
<- qchisq(0.05, df = k - 1, lower.tail = FALSE)
chi_critical chi_critical
[1] 7.814728
# P-value
<- pchisq(chi_sq, df = k - 1, lower.tail = FALSE)
p_value p_value
[1] 1.551501e-05
# One step
friedman.test(as.matrix(dtf))
Friedman rank sum test
data: as.matrix(dtf)
Friedman chi-squared = 25.763, df = 3, p-value = 1.069e-05
Tips: the sample is too small, so the results calculated manually may different from those using the R function which uses more accurate mathematical methods to calculate the value of the statistic.
13 Multiple linear regression
13.1 Definition
Linear Regression: A statistical method to determine the independent quantitative relationship between one or more predictor variables and one dependent variable.
- Simple Linear Regression: Only one predictor variable.
- Multiple linear regression: Two or more predictor variables.
Selection of predictor variables:
- The predictor variables must have a significant influence on the dependent variable and show a close linear correlation.
- The predictor variables should be mutually exclusive. The degree of correlation between independent variables should not be higher than the degree of correlation between independent variables and dependent variables.
- The predictor variables should have complete statistical data, and their predicted values can be easily determined.
Model: \[\begin{array}{*{20}{l}} {{y_1} = {\beta _0} + {\beta _1}{X_{11}} + {\beta _2}{X_{12}}... + {\beta _k}{X_{1k}} + {\varepsilon _1}}\\ {{y_2} = {\beta _0} + {\beta _1}{X_{21}} + {\beta _2}{X_{22}}... + {\beta _k}{X_{2k}} + {\varepsilon _2}}\\ {....}\\ {y_i} = {\beta _0} + {\beta _1}{X_{i1}} + {\beta _2}{X_{i2}}... + {\beta _k}{X_{ik}} + {\varepsilon _i}\\ ....\\ y_n = {\beta _0} + {\beta _1}{X_{n1}} + {\beta _2}{X_{n2}}... + {\beta _k}{X_{nk}} + \varepsilon _n \end{array}\] \[\begin{array}{*{20}{l}} {\begin{array}{*{20}{l}} {{\bf{Y = (}}{{\bf{y}}_{\bf{1}}}{\bf{,}}{{\bf{y}}_{\bf{2}}}{\bf{,}}...{\bf{,}}{{\bf{y}}_{\bf{n}}}{{\bf{)}}^{\bf{T}}}}\\ {{\bf{\beta = (}}{{\bf{\beta }}_{\bf{1}}}{\bf{,}}{{\bf{\beta }}_{\bf{2}}}{\bf{,}}...{\bf{,}}{{\bf{\beta }}_{\bf{k}}}{{\bf{)}}^{\bf{T}}}} \end{array}}\\ {{\bf{\varepsilon = (}}{{\bf{\varepsilon }}_{\bf{1}}}{\bf{,}}{{\bf{\varepsilon }}_{\bf{2}}}{\bf{,}}...{\bf{,}}{{\bf{\varepsilon }}_{\bf{n}}}{{\bf{)}}^{\bf{T}}}} \end{array}\] \[{\bf{X}} = \left( {\begin{array}{*{20}{c}} 1 & {X_{11}}& {X_{12}} & \ldots &{{X_{1k}}}\\ \vdots&\vdots &\vdots& \ddots & \vdots \\ 1& {X_{n1}}&{X_{n2}}&\cdots&{{X_{nk}}} \end{array}} \right)\] \[\bf{Y = X\beta + \varepsilon }\]
- \(X\): predictor variable
- \(y\): dependent variable
- \(\beta_k\): regression coefficient
- \(\epsilon\): effect of other random factors.
- The least-squares estimator of \(\beta\): \[\bf{\beta = (X'X)^{-1}X'Y}\]
- The least square estimation of random error variance: \[{\sigma ^2}{\rm{ = }}\frac{{{e^T}e}}{{n - k - 1}}\]
Workflow:
- Fit the model by
lm()
- Diagnose the model by
summary()
- Optimize the model by
step()
- Fit the model by
13.2 Visualization
# Import the dataset which is about the US Crime Data
data(usc, package = "ACSWR")
# Draw the pair plot to visualize the correlation
## pairs(usc)
::ggpairs(usc) GGally
# From the plot above, we can calculate that there have high correlation coefficient between Ex0 and Ex1
cor(usc$Ex0,usc$Ex1)
[1] 0.9935865
# Show the correlation coefficient directly
<- cor(usc)
ucor ucor
R Age S Ed Ex0 Ex1
R 1.00000000 -0.08947240 -0.09063696 0.32283487 0.68760446 0.66671414
Age -0.08947240 1.00000000 0.58435534 -0.53023964 -0.50573690 -0.51317336
S -0.09063696 0.58435534 1.00000000 -0.70274132 -0.37263633 -0.37616753
Ed 0.32283487 -0.53023964 -0.70274132 1.00000000 0.48295213 0.49940958
Ex0 0.68760446 -0.50573690 -0.37263633 0.48295213 1.00000000 0.99358648
Ex1 0.66671414 -0.51317336 -0.37616753 0.49940958 0.99358648 1.00000000
LF 0.18886635 -0.16094882 -0.50546948 0.56117795 0.12149320 0.10634960
M 0.21391426 -0.02867993 -0.31473291 0.43691492 0.03376027 0.02284250
N 0.33747406 -0.28063762 -0.04991832 -0.01722740 0.52628358 0.51378940
NW 0.03259884 0.59319826 0.76710262 -0.66488190 -0.21370878 -0.21876821
U1 -0.05047792 -0.22438060 -0.17241931 0.01810345 -0.04369761 -0.05171199
U2 0.17732065 -0.24484339 0.07169289 -0.21568155 0.18509304 0.16922422
W 0.44131995 -0.67005506 -0.63694543 0.73599704 0.78722528 0.79426205
X -0.17902373 0.63921138 0.73718106 -0.76865789 -0.63050025 -0.64815183
LF M N NW U1 U2
R 0.1888663 0.21391426 0.33747406 0.03259884 -0.05047792 0.17732065
Age -0.1609488 -0.02867993 -0.28063762 0.59319826 -0.22438060 -0.24484339
S -0.5054695 -0.31473291 -0.04991832 0.76710262 -0.17241931 0.07169289
Ed 0.5611780 0.43691492 -0.01722740 -0.66488190 0.01810345 -0.21568155
Ex0 0.1214932 0.03376027 0.52628358 -0.21370878 -0.04369761 0.18509304
Ex1 0.1063496 0.02284250 0.51378940 -0.21876821 -0.05171199 0.16922422
LF 1.0000000 0.51355879 -0.12367222 -0.34121444 -0.22939968 -0.42076249
M 0.5135588 1.00000000 -0.41062750 -0.32730454 0.35189190 -0.01869169
N -0.1236722 -0.41062750 1.00000000 0.09515301 -0.03811995 0.27042159
NW -0.3412144 -0.32730454 0.09515301 1.00000000 -0.15645002 0.08090829
U1 -0.2293997 0.35189190 -0.03811995 -0.15645002 1.00000000 0.74592482
U2 -0.4207625 -0.01869169 0.27042159 0.08090829 0.74592482 1.00000000
W 0.2946323 0.17960864 0.30826271 -0.59010707 0.04485720 0.09207166
X -0.2698865 -0.16708869 -0.12629357 0.67731286 -0.06383218 0.01567818
W X
R 0.44131995 -0.17902373
Age -0.67005506 0.63921138
S -0.63694543 0.73718106
Ed 0.73599704 -0.76865789
Ex0 0.78722528 -0.63050025
Ex1 0.79426205 -0.64815183
LF 0.29463231 -0.26988646
M 0.17960864 -0.16708869
N 0.30826271 -0.12629357
NW -0.59010707 0.67731286
U1 0.04485720 -0.06383218
U2 0.09207166 0.01567818
W 1.00000000 -0.88399728
X -0.88399728 1.00000000
# Draw the correlation plot divided by two part (numerical & graphic)
::corrplot(ucor, order = 'AOE', type = "upper", method = "number")
corrplot::corrplot(ucor, order = "AOE", type = "lower", method = "ell",
corrplotdiag = FALSE, tl.pos = "n", cl.pos = "n", add = TRUE)
13.3 Fit the model
13.3.1 Analysis
# Do linear regression
<- lm(R ~ ., data = usc)
crime_rate_lm summary(crime_rate_lm)
Call:
lm(formula = R ~ ., data = usc)
Residuals:
Min 1Q Median 3Q Max
-34.884 -11.923 -1.135 13.495 50.560
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.918e+02 1.559e+02 -4.438 9.56e-05 ***
Age 1.040e+00 4.227e-01 2.460 0.01931 *
S -8.308e+00 1.491e+01 -0.557 0.58117
Ed 1.802e+00 6.496e-01 2.773 0.00906 **
Ex0 1.608e+00 1.059e+00 1.519 0.13836
Ex1 -6.673e-01 1.149e+00 -0.581 0.56529
LF -4.103e-02 1.535e-01 -0.267 0.79087
M 1.648e-01 2.099e-01 0.785 0.43806
N -4.128e-02 1.295e-01 -0.319 0.75196
NW 7.175e-03 6.387e-02 0.112 0.91124
U1 -6.017e-01 4.372e-01 -1.376 0.17798
U2 1.792e+00 8.561e-01 2.093 0.04407 *
W 1.374e-01 1.058e-01 1.298 0.20332
X 7.929e-01 2.351e-01 3.373 0.00191 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.94 on 33 degrees of freedom
Multiple R-squared: 0.7692, Adjusted R-squared: 0.6783
F-statistic: 8.462 on 13 and 33 DF, p-value: 3.686e-07
# Computes confidence intervals for each parameters in this linear regression model
confint(crime_rate_lm)
2.5 % 97.5 %
(Intercept) -1.008994e+03 -374.6812339
Age 1.798032e-01 1.8998161
S -3.864617e+01 22.0295401
Ed 4.798774e-01 3.1233247
Ex0 -5.460558e-01 3.7616925
Ex1 -3.004455e+00 1.6699389
LF -3.532821e-01 0.2712200
M -2.623148e-01 0.5919047
N -3.047793e-01 0.2222255
NW -1.227641e-01 0.1371135
U1 -1.491073e+00 0.2877222
U2 5.049109e-02 3.5340347
W -7.795484e-02 0.3526718
X 3.146483e-01 1.2712173
# Compute analysis of variance tables for this linear regression model
anova(crime_rate_lm)
Analysis of Variance Table
Response: R
Df Sum Sq Mean Sq F value Pr(>F)
Age 1 550.8 550.8 1.1448 0.2924072
S 1 153.7 153.7 0.3194 0.5757727
Ed 1 9056.7 9056.7 18.8221 0.0001275 ***
Ex0 1 30760.3 30760.3 63.9278 3.182e-09 ***
Ex1 1 1530.2 1530.2 3.1802 0.0837349 .
LF 1 611.3 611.3 1.2705 0.2677989
M 1 1110.0 1110.0 2.3069 0.1383280
N 1 426.5 426.5 0.8864 0.3533080
NW 1 142.0 142.0 0.2951 0.5906485
U1 1 70.7 70.7 0.1468 0.7040339
U2 1 2696.6 2696.6 5.6043 0.0239336 *
W 1 347.5 347.5 0.7221 0.4015652
X 1 5474.2 5474.2 11.3768 0.0019126 **
Residuals 33 15878.7 481.2
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
13.3.2 Conclusion
- The intercept terms, Age, ED, U2, and X are significant variables to explain the crime rate. The 95% confidence intervals also confirm it.
- The model is significant: \(p < 0.05\), adjusted \(R^2 = 0.6783\).
13.3.3 The difference between \(R^2\) and Adj-\(R^2\)
<- function(x, method){
get_R2 round(summary(lm(usc$R ~ as.matrix(usc[, 2:x])))[[method]], 3)
}<- data.frame(n = 1:13,
dtf_R2 R2 = sapply(2:14, get_R2, method = 'r.squared'),
AdjR2 = sapply(2:14, get_R2, method = 'adj.r.squared'))
library(ggplot2)
library(tidyr)
|>
dtf_R2 pivot_longer(-n) |>
ggplot() +
geom_point(aes(n, value, col = name)) +
geom_line(aes(n, value, col = name))
13.4 Evaluate & improve the model
13.4.1 Issue
Multicollinearity:
- multi-: multiple columns
- col-: relationship
- linearity: linear
Problems:
- Imprecise estimates of \(\beta\)
- The t-tests may fail to reveal significant factors
- Missing importance of predictors
# Single linear regression
data(Euphorbiaceae, package = 'gpk')
<- data.frame(x1 = Euphorbiaceae$GBH,
dtf y = Euphorbiaceae$Height)
plot(dtf)
lm(y ~ x1, data = dtf) |> summary()
Call:
lm(formula = y ~ x1, data = dtf)
Residuals:
Min 1Q Median 3Q Max
-29.369 -5.543 -1.566 5.707 28.086
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.76295 2.15471 4.067 9.29e-05 ***
x1 0.30303 0.03612 8.390 2.56e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.417 on 104 degrees of freedom
Multiple R-squared: 0.4037, Adjusted R-squared: 0.3979
F-statistic: 70.4 on 1 and 104 DF, p-value: 2.564e-13
# Multiply linear regression
$x2 <- jitter(dtf$x) # Add some random error to data
dtfpairs(dtf)
lm(y ~ x1 + x2, data = dtf) |> summary()
Call:
lm(formula = y ~ x1 + x2, data = dtf)
Residuals:
Min 1Q Median 3Q Max
-29.417 -5.501 -1.563 5.707 28.061
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.7600 2.1659 4.045 0.000101 ***
x1 0.7077 7.9843 0.089 0.929547
x2 -0.4045 7.9822 -0.051 0.959679
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.463 on 103 degrees of freedom
Multiple R-squared: 0.4037, Adjusted R-squared: 0.3921
F-statistic: 34.86 on 2 and 103 DF, p-value: 2.736e-12
13.4.2 Variance inflation factor (VIF)
\[\mathrm{VIF}_j = \frac{1}{1-R_j^2}\]
- \(\mathrm{VIF}_j\): \(VIF\) for the \(j\)-th variable
- \(R^2_j\): \(R^2\) from the regression of the \(j\)-th explanatory variable on the remaining explanatory variables.
Remove the R
from usc dataset:
<- usc[, -1]
uscrimewor names(uscrimewor)
[1] "Age" "S" "Ed" "Ex0" "Ex1" "LF" "M" "N" "NW" "U1" "U2" "W"
[13] "X"
Calculate the \(VIF\):
# Compute the VIF for Ex0
<- summary(lm(Ex0 ~ ., data = uscrimewor))
usc_lm_Ex0 1 / (1 - usc_lm_Ex0$r.squared)
[1] 94.63312
# Compute all of the VIF
::vif(uscrimewor) faraway
Age S Ed Ex0 Ex1 LF M N
2.698021 4.876751 5.049442 94.633118 98.637233 3.677557 3.658444 2.324326
NW U1 U2 W X
4.123274 5.938264 4.997617 9.968958 8.409449
Filter: \(VIF > 10\) (\(R^2 > 0.9\))
Drop the variable with highest \(VIF\) and then update the model until all \(VIF \leq 10\):
<- uscrimewor[, c('Age','S','Ed','Ex0','LF','M','N','NW','U1','U2','W','X')]
uscrimewor2 ::vif(uscrimewor2) faraway
Age S Ed Ex0 LF M N NW
2.670798 4.864533 4.822240 5.109739 3.414298 3.657629 2.321929 3.963407
U1 U2 W X
5.912063 4.982983 9.955587 8.354136
<- lm(R ~ Age + S + Ed + Ex0 + LF + M + N + NW + U1 + U2 + W + X, usc)
crime_rate_lm2 summary(crime_rate_lm2)
Call:
lm(formula = R ~ Age + S + Ed + Ex0 + LF + M + N + NW + U1 +
U2 + W + X, data = usc)
Residuals:
Min 1Q Median 3Q Max
-38.76 -13.59 1.09 13.25 48.93
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.041e+02 1.529e+02 -4.604 5.58e-05 ***
Age 1.064e+00 4.165e-01 2.556 0.015226 *
S -7.875e+00 1.475e+01 -0.534 0.596823
Ed 1.722e+00 6.286e-01 2.739 0.009752 **
Ex0 1.010e+00 2.436e-01 4.145 0.000213 ***
LF -1.718e-02 1.464e-01 -0.117 0.907297
M 1.630e-01 2.079e-01 0.784 0.438418
N -3.886e-02 1.282e-01 -0.303 0.763604
NW -1.299e-04 6.200e-02 -0.002 0.998340
U1 -5.848e-01 4.319e-01 -1.354 0.184674
U2 1.819e+00 8.465e-01 2.149 0.038833 *
W 1.351e-01 1.047e-01 1.290 0.205711
X 8.040e-01 2.320e-01 3.465 0.001453 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.72 on 34 degrees of freedom
Multiple R-squared: 0.7669, Adjusted R-squared: 0.6846
F-statistic: 9.32 on 12 and 34 DF, p-value: 1.351e-07
Conclusion: The 12 explanatory variables account for 76.7% of the variability in the crime rates of the 47 states.
13.4.3 Eigen System Analysis
<- as.matrix(uscrimewor)
uscrimewor <- scale(uscrimewor)
usc_stan <- t(usc_stan) %*% usc_stan
x1x_stan <- eigen(x1x_stan)
usc_eigen <- max(usc_eigen$values) / min(usc_eigen$values) # Condition number k usc_kappa
- \(k\) < 100: no multicollinearity
- 100 < \(k\) < 1000: moderate multicollinearity
- \(k\) > 1000: severe multicollinearity
<- max(usc_eigen$values) / usc_eigen$values # Condition indices
usec_index $vectors[, usec_index > 1000] usc_eigen
[1] -0.010618030 -0.005195201 0.032819351 0.698089398 -0.713641697
[6] -0.033483032 -0.003217082 -0.008582334 0.022697876 -0.004346733
[11] -0.013567474 -0.003024453 -0.015409343
We still don’t know which one should be remove, so we’d better use Variance inflation factor (VIF).
13.5 Simply the model
13.5.1 Backward selection
- Start with all the predictors in the model.
- Remove the predictor with highest p-value greater than \(\alpha\) (= 0.05 usually).
- Refit the model and go to 2.
- Stop when all p-values are less than \(\alpha\).
# This function is used to calculate the coefficients of the linear regression model and sort them from largest to smallest by P-value
<- function(model){
get_p <- data.frame(summary(model)$coefficients)
smry2 order(-smry2$Pr...t..), ]
smry2[
}get_p(crime_rate_lm2)
Estimate Std..Error t.value Pr...t..
NW -1.299152e-04 0.06200366 -0.002095283 9.983405e-01
LF -1.717960e-02 0.14643342 -0.117320197 9.072966e-01
N -3.886138e-02 0.12818167 -0.303174262 7.636044e-01
S -7.874779e+00 14.74705947 -0.533989803 5.968230e-01
M 1.629746e-01 0.20785304 0.784085841 4.384185e-01
W 1.351072e-01 0.10472364 1.290130466 2.057112e-01
U1 -5.848089e-01 0.43191770 -1.353982319 1.846739e-01
U2 1.819172e+00 0.84648600 2.149086500 3.883343e-02
Age 1.064473e+00 0.41645193 2.556051830 1.522626e-02
Ed 1.721558e+00 0.62864970 2.738501354 9.752489e-03
X 8.040071e-01 0.23201637 3.465303351 1.452904e-03
Ex0 1.009730e+00 0.24359222 4.145163752 2.131854e-04
(Intercept) -7.040822e+02 152.94347802 -4.603545173 5.576618e-05
# Update-1
<- update(crime_rate_lm2, . ~ . - NW)
crime_rate_lm3 get_p(crime_rate_lm3)
Estimate Std..Error t.value Pr...t..
LF -0.01726849 0.1381358 -0.1250110 9.012301e-01
N -0.03886188 0.1263370 -0.3076048 7.602061e-01
S -7.88953914 12.7694182 -0.6178464 5.406763e-01
M 0.16308486 0.1981909 0.8228675 4.161547e-01
W 0.13515532 0.1007004 1.3421533 1.881884e-01
U1 -0.58495622 0.4200266 -1.3926648 1.725057e-01
U2 1.81920587 0.8341503 2.1809089 3.599816e-02
Age 1.06419549 0.3891918 2.7343726 9.740218e-03
Ed 1.72173268 0.6141343 2.8035115 8.189455e-03
X 0.80399165 0.2285624 3.5176026 1.227429e-03
Ex0 1.00951551 0.2179312 4.6322680 4.846288e-05
(Intercept) -704.12001486 149.6901234 -4.7038509 3.911594e-05
# Update-2
<- update(crime_rate_lm3, . ~ . - LF)
crime_rate_lm4 get_p(crime_rate_lm4)
Estimate Std..Error t.value Pr...t..
N -0.04145888 0.1229018 -0.3373334 7.378245e-01
S -7.18567759 11.3033138 -0.6357142 5.289835e-01
M 0.15058853 0.1687793 0.8922216 3.781994e-01
W 0.13416908 0.0990088 1.3551227 1.838211e-01
U1 -0.56348963 0.3780445 -1.4905379 1.447935e-01
U2 1.81276664 0.8210969 2.2077377 3.372029e-02
Age 1.06639529 0.3834415 2.7811163 8.565270e-03
Ed 1.70183790 0.5849903 2.9091728 6.175924e-03
X 0.79674056 0.2180364 3.6541625 8.159010e-04
Ex0 1.01443740 0.2113944 4.7987905 2.774137e-05
(Intercept) -700.19856796 144.3514585 -4.8506511 2.369307e-05
# Update-3
<- update(crime_rate_lm4,.~.-N)
crime_rate_lm5 get_p(crime_rate_lm5)
Estimate Std..Error t.value Pr...t..
S -6.5057045 10.98812498 -0.5920668 5.574067e-01
M 0.1772842 0.14728022 1.2037201 2.363427e-01
W 0.1282847 0.09628581 1.3323326 1.908991e-01
U1 -0.5752009 0.37191143 -1.5466073 1.304695e-01
U2 1.8090238 0.81113002 2.2302513 3.187983e-02
Age 1.0685350 0.37876979 2.8210671 7.651917e-03
Ed 1.7019848 0.57794198 2.9449059 5.556896e-03
X 0.7710981 0.20189436 3.8193147 4.943947e-04
Ex0 0.9833297 0.18792831 5.2324725 6.864110e-06
(Intercept) -716.5469741 134.33461432 -5.3340457 5.005846e-06
# Update-4
<- update(crime_rate_lm5,.~.-S)
crime_rate_lm6 get_p(crime_rate_lm6)
Estimate Std..Error t.value Pr...t..
M 0.1848948 0.14545901 1.271113 2.114146e-01
W 0.1271449 0.09544037 1.332192 1.907319e-01
U1 -0.5295794 0.36071897 -1.468122 1.503006e-01
U2 1.7070199 0.78581970 2.172279 3.614090e-02
Age 1.0083194 0.36172869 2.787502 8.248097e-03
Ed 1.7416360 0.56912198 3.060216 4.044243e-03
X 0.7320980 0.18920844 3.869267 4.153598e-04
Ex0 0.9785376 0.18614257 5.256926 5.935205e-06
(Intercept) -714.4047282 133.13339398 -5.366082 4.210448e-06
# Update-5
<- update(crime_rate_lm6,.~.-M)
crime_rate_lm7 get_p(crime_rate_lm7)
Estimate Std..Error t.value Pr...t..
U1 -0.3091858 0.3188025 -0.969835 3.381053e-01
W 0.1454765 0.0950863 1.529942 1.341027e-01
U2 1.4415600 0.7635174 1.888051 6.647498e-02
Age 1.1317545 0.3511903 3.222624 2.566455e-03
Ed 2.0273937 0.5269502 3.847410 4.309248e-04
X 0.7960871 0.1838228 4.330732 1.006325e-04
Ex0 0.9862878 0.1875055 5.260046 5.496412e-06
(Intercept) -614.6654324 108.3983291 -5.670433 1.485771e-06
# Update-6
<- update(crime_rate_lm7,.~.-U1)
crime_rate_lm8 get_p(crime_rate_lm8)
Estimate Std..Error t.value Pr...t..
W 0.1595649 0.0939003 1.699302 9.702803e-02
U2 0.8281694 0.4273992 1.937695 5.974283e-02
Age 1.1251803 0.3508640 3.206884 2.640080e-03
Ed 1.8178631 0.4802674 3.785106 5.048962e-04
X 0.8235714 0.1814902 4.537828 5.095878e-05
(Intercept) -618.5028439 108.2456006 -5.713884 1.193042e-06
Ex0 1.0506875 0.1752237 5.996264 4.783972e-07
# Update-7
<- update(crime_rate_lm8,.~.-W)
crime_rate_lm9 get_p(crime_rate_lm9)
Estimate Std..Error t.value Pr...t..
U2 0.9136081 0.4340919 2.104642 4.149584e-02
Age 1.0198219 0.3532027 2.887356 6.175068e-03
Ed 2.0307726 0.4741890 4.282623 1.086147e-04
X 0.6349257 0.1468460 4.323752 9.559356e-05
(Intercept) -524.3743270 95.1155692 -5.513023 2.126586e-06
Ex0 1.2331222 0.1416347 8.706359 7.259666e-11
# Final summary
summary(crime_rate_lm9)
Call:
lm(formula = R ~ Age + Ed + Ex0 + U2 + X, data = usc)
Residuals:
Min 1Q Median 3Q Max
-45.344 -9.859 -1.807 10.603 62.964
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -524.3743 95.1156 -5.513 2.13e-06 ***
Age 1.0198 0.3532 2.887 0.006175 **
Ed 2.0308 0.4742 4.283 0.000109 ***
Ex0 1.2331 0.1416 8.706 7.26e-11 ***
U2 0.9136 0.4341 2.105 0.041496 *
X 0.6349 0.1468 4.324 9.56e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.3 on 41 degrees of freedom
Multiple R-squared: 0.7296, Adjusted R-squared: 0.6967
F-statistic: 22.13 on 5 and 41 DF, p-value: 1.105e-10
13.5.2 Forward selection
- Start with no variables in the model.
- For all predictors not in the model, check their p-value if they are added to the model. Choose the one with lowest p-value less than alpha critical.
- Continue until no new predictors can be added.
13.5.3 AIC selection
Akaike Information Criteria (AIC) principle: big -> bad
<- step(crime_rate_lm, direction = "both") crime_rate_aic
Start: AIC=301.66
R ~ Age + S + Ed + Ex0 + Ex1 + LF + M + N + NW + U1 + U2 + W +
X
Df Sum of Sq RSS AIC
- NW 1 6.1 15885 299.68
- LF 1 34.4 15913 299.76
- N 1 48.9 15928 299.81
- S 1 149.4 16028 300.10
- Ex1 1 162.3 16041 300.14
- M 1 296.5 16175 300.53
<none> 15879 301.66
- W 1 810.6 16689 302.00
- U1 1 911.5 16790 302.29
- Ex0 1 1109.8 16989 302.84
- U2 1 2108.8 17988 305.52
- Age 1 2911.6 18790 307.57
- Ed 1 3700.5 19579 309.51
- X 1 5474.2 21353 313.58
Step: AIC=299.68
R ~ Age + S + Ed + Ex0 + Ex1 + LF + M + N + U1 + U2 + W + X
Df Sum of Sq RSS AIC
- LF 1 28.7 15913 297.76
- N 1 48.6 15933 297.82
- Ex1 1 156.3 16041 298.14
- S 1 158.0 16043 298.14
- M 1 294.1 16179 298.54
<none> 15885 299.68
- W 1 820.2 16705 300.05
- U1 1 913.1 16798 300.31
- Ex0 1 1104.3 16989 300.84
+ NW 1 6.1 15879 301.66
- U2 1 2107.1 17992 303.53
- Age 1 3365.8 19251 306.71
- Ed 1 3757.1 19642 307.66
- X 1 5503.6 21388 311.66
Step: AIC=297.76
R ~ Age + S + Ed + Ex0 + Ex1 + M + N + U1 + U2 + W + X
Df Sum of Sq RSS AIC
- N 1 62.2 15976 295.95
- S 1 129.4 16043 296.14
- Ex1 1 134.8 16048 296.16
- M 1 276.8 16190 296.57
<none> 15913 297.76
- W 1 801.9 16715 298.07
- U1 1 941.8 16855 298.47
- Ex0 1 1075.9 16989 298.84
+ LF 1 28.7 15885 299.68
+ NW 1 0.3 15913 299.76
- U2 1 2088.5 18002 301.56
- Age 1 3407.9 19321 304.88
- Ed 1 3895.3 19809 306.06
- X 1 5621.3 21535 309.98
Step: AIC=295.95
R ~ Age + S + Ed + Ex0 + Ex1 + M + U1 + U2 + W + X
Df Sum of Sq RSS AIC
- S 1 104.4 16080 294.25
- Ex1 1 123.3 16099 294.31
- M 1 533.8 16509 295.49
<none> 15976 295.95
- W 1 748.7 16724 296.10
- U1 1 997.7 16973 296.80
- Ex0 1 1021.3 16997 296.86
+ N 1 62.2 15913 297.76
+ LF 1 42.3 15933 297.82
+ NW 1 0.0 15976 297.95
- U2 1 2082.3 18058 299.71
- Age 1 3425.9 19402 303.08
- Ed 1 3887.6 19863 304.19
- X 1 5896.9 21873 308.71
Step: AIC=294.25
R ~ Age + Ed + Ex0 + Ex1 + M + U1 + U2 + W + X
Df Sum of Sq RSS AIC
- Ex1 1 171.5 16252 292.75
- M 1 563.4 16643 293.87
<none> 16080 294.25
- W 1 734.7 16815 294.35
- U1 1 906.0 16986 294.83
- Ex0 1 1162.0 17242 295.53
+ S 1 104.4 15976 295.95
+ N 1 37.2 16043 296.14
+ NW 1 14.2 16066 296.21
+ LF 1 1.9 16078 296.25
- U2 1 1978.0 18058 297.71
- Age 1 3354.5 19434 301.16
- Ed 1 4139.1 20219 303.02
- X 1 6094.8 22175 307.36
Step: AIC=292.75
R ~ Age + Ed + Ex0 + M + U1 + U2 + W + X
Df Sum of Sq RSS AIC
- M 1 691.0 16943 292.71
<none> 16252 292.75
- W 1 759.0 17011 292.90
- U1 1 921.8 17173 293.35
+ Ex1 1 171.5 16080 294.25
+ S 1 152.5 16099 294.31
+ NW 1 36.0 16215 294.65
+ N 1 23.1 16228 294.69
+ LF 1 5.5 16246 294.74
- U2 1 2018.1 18270 296.25
- Age 1 3323.1 19575 299.50
- Ed 1 4005.1 20257 301.11
- X 1 6402.7 22654 306.36
- Ex0 1 11818.8 28070 316.44
Step: AIC=292.71
R ~ Age + Ed + Ex0 + U1 + U2 + W + X
Df Sum of Sq RSS AIC
- U1 1 408.6 17351 291.83
<none> 16943 292.71
+ M 1 691.0 16252 292.75
- W 1 1016.9 17959 293.45
+ Ex1 1 299.0 16643 293.87
+ N 1 262.5 16680 293.98
+ LF 1 213.3 16729 294.11
+ S 1 213.1 16729 294.11
+ NW 1 114.2 16828 294.39
- U2 1 1548.6 18491 294.82
- Age 1 4511.6 21454 301.81
- Ed 1 6430.6 23373 305.83
- X 1 8147.7 25090 309.16
- Ex0 1 12019.6 28962 315.91
Step: AIC=291.83
R ~ Age + Ed + Ex0 + U2 + W + X
Df Sum of Sq RSS AIC
<none> 17351 291.83
+ U1 1 408.6 16942 292.71
- W 1 1252.6 18604 293.11
+ Ex1 1 251.2 17100 293.14
+ LF 1 230.7 17120 293.20
+ N 1 189.6 17162 293.31
+ M 1 177.8 17173 293.35
+ S 1 71.0 17280 293.64
+ NW 1 59.2 17292 293.67
- U2 1 1628.7 18980 294.05
- Age 1 4461.0 21812 300.58
- Ed 1 6214.7 23566 304.22
- X 1 8932.3 26283 309.35
- Ex0 1 15596.5 32948 319.97
summary(crime_rate_aic)
Call:
lm(formula = R ~ Age + Ed + Ex0 + U2 + W + X, data = usc)
Residuals:
Min 1Q Median 3Q Max
-38.306 -10.209 -1.313 9.919 54.544
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -618.5028 108.2456 -5.714 1.19e-06 ***
Age 1.1252 0.3509 3.207 0.002640 **
Ed 1.8179 0.4803 3.785 0.000505 ***
Ex0 1.0507 0.1752 5.996 4.78e-07 ***
U2 0.8282 0.4274 1.938 0.059743 .
W 0.1596 0.0939 1.699 0.097028 .
X 0.8236 0.1815 4.538 5.10e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.83 on 40 degrees of freedom
Multiple R-squared: 0.7478, Adjusted R-squared: 0.71
F-statistic: 19.77 on 6 and 40 DF, p-value: 1.441e-10
13.5.4 Result
According to the backward selection:
::extract_eq(crime_rate_lm9, use_coefs = TRUE) equatiomatic
\[ \operatorname{\widehat{R}} = -524.37 + 1.02(\operatorname{Age}) + 2.03(\operatorname{Ed}) + 1.23(\operatorname{Ex0}) + 0.91(\operatorname{U2}) + 0.63(\operatorname{X}) \]
::stargazer(crime_rate_lm9, type = 'text') stargazer
===============================================
Dependent variable:
---------------------------
R
-----------------------------------------------
Age 1.020***
(0.353)
Ed 2.031***
(0.474)
Ex0 1.233***
(0.142)
U2 0.914**
(0.434)
X 0.635***
(0.147)
Constant -524.374***
(95.116)
-----------------------------------------------
Observations 47
R2 0.730
Adjusted R2 0.697
Residual Std. Error 21.301 (df = 41)
F Statistic 22.129*** (df = 5; 41)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
The model shows that the variables Age, Ed, Ex0, U2, and X explain 73% of the variability in the crime rates.
According to the AIC selection:
::extract_eq(crime_rate_aic, use_coefs = TRUE) equatiomatic
\[ \operatorname{\widehat{R}} = -618.5 + 1.13(\operatorname{Age}) + 1.82(\operatorname{Ed}) + 1.05(\operatorname{Ex0}) + 0.83(\operatorname{U2}) + 0.16(\operatorname{W}) + 0.82(\operatorname{X}) \]
::stargazer(crime_rate_aic, type = 'text') stargazer
===============================================
Dependent variable:
---------------------------
R
-----------------------------------------------
Age 1.125***
(0.351)
Ed 1.818***
(0.480)
Ex0 1.051***
(0.175)
U2 0.828*
(0.427)
W 0.160*
(0.094)
X 0.824***
(0.181)
Constant -618.503***
(108.246)
-----------------------------------------------
Observations 47
R2 0.748
Adjusted R2 0.710
Residual Std. Error 20.827 (df = 40)
F Statistic 19.771*** (df = 6; 40)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
The model shows that the variables Age, Ed, Ex0, U2, W, and X explain 75% of the variability in the crime rates.
14 Logistic regression
14.1 Limitation of linear model
Some limitations of linear models:
- Out of the range of possible y
- The residuals-vs-predicted plot indicates two parallel lines, which is very unlikely if a linear model is appropriate.
# Scatter plot + abline
library(ggplot2)
<- iris[iris$Species %in% c('versicolor', 'setosa'),
iris2 c('Species', 'Sepal.Length', 'Petal.Length')]
$SpeciesLevel <- as.integer(iris2$Species) - 1 # 0: setosa. 1:versicolor
iris2ggplot(iris2) +
geom_point(aes(Sepal.Length, SpeciesLevel), alpha = 0.2, size = 6) +
geom_smooth(aes(Sepal.Length, SpeciesLevel), method = 'lm')
# Check the statistic analysis
<- lm(SpeciesLevel ~ Sepal.Length, data = iris2)
lm_iris2 summary(lm_iris2)
Call:
lm(formula = SpeciesLevel ~ Sepal.Length, data = iris2)
Residuals:
Min 1Q Median 3Q Max
-0.68764 -0.23802 -0.04506 0.25533 0.82566
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.62027 0.29859 -8.776 5.47e-14 ***
Sepal.Length 0.57033 0.05421 10.521 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3461 on 98 degrees of freedom
Multiple R-squared: 0.5304, Adjusted R-squared: 0.5256
F-statistic: 110.7 on 1 and 98 DF, p-value: < 2.2e-16
# Multiply residual plots
par(mfrow = c(2, 2))
plot(lm_iris2)
So we can use those code instead of the previous one:
$SepalBin <- round(iris2$Sepal.Length * 4) / 4
iris2<- table(iris2$SepalBin, iris2$SpeciesLevel)
iris_tbl <- data.frame(cbind(iris_tbl))
iris_p $SepalBin <- as.numeric(row.names(iris_p))
iris_p$p <- iris_p$X1 / (iris_p$X0 + iris_p$X1) # Proportion of versicolor
iris_pggplot(iris_p) + geom_point(aes(SepalBin, p))
Then we can use the Sigmoid curve: \(\frac{\ 1}{\ 1 +e^{-x}}\) to describe this relationship:
curve(1 / (1 + exp(-x)), xlim = c(-10, 10))
abline(h = 0:1, col = "royalblue4",lty = 2)
14.2 Binary logistic regression
14.2.1 Concept
(Binary) Logistic regression:
Use a logistic function to model a binary dependent variable (1/0, yes/no, true/false, win/lose, male/female) against one or multiple qualitative variables. \[log \frac{p(X)}{1-p(X)} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_pX_p\] Can convert into: \[p(X) = \frac{e^{\beta_0 + \beta_1X_1 + \beta_2X_2 + … + \beta_pX_p}}{\ (1 + e^{\beta_0 + \beta_1X_1 + \beta_2X_2 + … + \beta_pX_p})}\]
- \(X_i\): The \(i\)-th predictor variable
- \(\beta_i\): The coefficient estimate for the \(i\)-th predictor variable
14.2.2 Example
14.2.2.1 Fit the model
Iris data:
# Check the iris2 structure
head(iris2, 3)
Species Sepal.Length Petal.Length SpeciesLevel SepalBin
1 setosa 5.1 1.4 0 5.00
2 setosa 4.9 1.4 0 5.00
3 setosa 4.7 1.3 0 4.75
# Summarize the statistic analysis on iris2 dataset between SpeciesLevel & Sepal.Length
<- glm(SpeciesLevel ~ Sepal.Length, data = iris2, family = binomial)
lg_iris1 summary(lg_iris1)
Call:
glm(formula = SpeciesLevel ~ Sepal.Length, family = binomial,
data = iris2)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.05501 -0.47395 -0.02829 0.39788 2.32915
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -27.831 5.434 -5.122 3.02e-07 ***
Sepal.Length 5.140 1.007 5.107 3.28e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.629 on 99 degrees of freedom
Residual deviance: 64.211 on 98 degrees of freedom
AIC: 68.211
Number of Fisher Scoring iterations: 6
# McFadden pseudo-R squared is a statistic used to evaluate the goodness of fit of logistic regression models
# The greater, the more important, when value is 0.2~0.4 then it is very good
::pR2(lg_iris1)["McFadden"] pscl
fitting null model for pseudo-r2
McFadden
0.5368136
# Extract the formula from lg_iris1
::extract_eq(lg_iris1, use_coefs = TRUE) equatiomatic
\[ \log\left[ \frac { \widehat{P( \operatorname{SpeciesLevel} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{SpeciesLevel} = \operatorname{1} )} } \right] = -27.83 + 5.14(\operatorname{Sepal.Length}) \]
# Show the statistic result
::stargazer(lg_iris1, type = 'text') stargazer
=============================================
Dependent variable:
---------------------------
SpeciesLevel
---------------------------------------------
Sepal.Length 5.140***
(1.007)
Constant -27.831***
(5.434)
---------------------------------------------
Observations 100
Log Likelihood -32.106
Akaike Inf. Crit. 68.211
=============================================
Note: *p<0.1; **p<0.05; ***p<0.01
14.2.2.2 Assess the model
# Do the statistic analysis on iris2 dataset between SpeciesLevel & Sepal.Length + Petal.Length
<- glm(SpeciesLevel ~ Sepal.Length + Petal.Length, data = iris2, family = binomial)
lg_iris2 # Understand which predictive variables contribute significantly to the predictive power of the model
# The greater, the more important
::varImp(lg_iris2) caret
Overall
Sepal.Length 0.0002083317
Petal.Length 0.0006938295
# Extract the formula from lg_iris2
::extract_eq(lg_iris2, use_coefs = TRUE) equatiomatic
\[ \log\left[ \frac { \widehat{P( \operatorname{SpeciesLevel} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{SpeciesLevel} = \operatorname{1} )} } \right] = 65.84 - 37.56(\operatorname{Sepal.Length}) + 49.05(\operatorname{Petal.Length}) \]
# Show the statistic result
::stargazer(lg_iris2, type = 'text') stargazer
=============================================
Dependent variable:
---------------------------
SpeciesLevel
---------------------------------------------
Sepal.Length -37.561
(180,296.200)
Petal.Length 49.046
(70,689.070)
Constant 65.843
(744,553.200)
---------------------------------------------
Observations 100
Log Likelihood -0.000
Akaike Inf. Crit. 6.000
=============================================
Note: *p<0.1; **p<0.05; ***p<0.01
14.2.2.3 Visualization
# Fit a liner regression line on it
ggplot(iris2, aes(x=Sepal.Length, y=SpeciesLevel)) +
geom_point() +
geom_smooth(method="glm", method.args = list(family=binomial)) +
labs(x = 'Sepal length (cm)', y = 'Species and p')
14.3 Multinomial and ordinal logistic regression
14.3.1 Concept
Multinomial logistic regression:
Use a logistic function to model a nominal variable with multiple levels against one or multiple qualitative variables.
Ordinal logistic regression:
Use a logistic function to model a ordinal variable with multiple levels against one or multiple qualitative variables.
14.3.1.1 As a Model for Odds
\[\frac{P(Y{i}=j|X{i})}{P(Y{i}=J|X{i})}=exp[\alpha_{j}+\beta_{j}x_{i}]\]
- \(j\): level of the dependent variable. 1, …, \(J\)
- \(P(Y{i}=j|X{i})\): The probability that individual \(i\) is in level \(j\) given a value of \(x_i\).
- The \(J\)-th* level: the baseline.
14.3.1.2 As a Model of Probabilities
\[{P(Y{i}=j|X{i})}=\frac{exp[\alpha_{j}+\beta_{j}x_{i}]}{\sum_{h=1}^{J}exp[\alpha_{h}+\beta_{h}x_{i}]},\quad where \,\, j = 1,...,J\]
14.3.2 Example
14.3.2.1 Prepare the data
The iris dataset. Can we predict the Species according to the sepal and petal lengths and widths?
- Dependent variable: Species (a categorical variable with 3 levels)
- Independent variables: Sepal and petal lengths and widths (4 numerical variables)?
Separate the dataset into training data (80%) and test data (20%) randomly:
set.seed(123)
<- sample(nrow(iris), round(0.8 * nrow(iris)))
train.sub <- iris[train.sub, ]
train.data <- iris[-train.sub, ] test.data
14.3.2.2 Fit the model
- Only two species displayed: setosa as the baseline level
library(nnet)
<- multinom(Species ~ ., data = train.data) model
# weights: 18 (10 variable)
initial value 131.833475
iter 10 value 12.430734
iter 20 value 2.945373
iter 30 value 2.027434
iter 40 value 1.883088
iter 50 value 1.835737
iter 60 value 1.697733
iter 70 value 1.351628
iter 80 value 0.944059
iter 90 value 0.776663
iter 100 value 0.737237
final value 0.737237
stopped after 100 iterations
summary(model)
Call:
multinom(formula = Species ~ ., data = train.data)
Coefficients:
(Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
versicolor 48.28722 -24.64201 -24.27893 47.2458 3.130761
virginica -83.41388 -58.94158 -60.89885 117.6548 65.130414
Std. Errors:
(Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
versicolor 106.5370 35.94035 27.03289 48.92239 32.11012
virginica 105.9997 46.24989 32.56462 41.91175 29.42979
Residual Deviance: 1.474475
AIC: 21.47447
\[O_{ve}=\frac{P(Y_i=versicolor)}{P(Y_i=setosa)} = exp(48.28722 - 24.64201L_s - 24.27893W_s + 47.2458L_p + 3.130761W_p)\] \[O_{vi} = \frac{P(Y_i=virginica)}{P(Y_i=setosa)} = exp(-83.41388 - 58.94158L_s - 60.89885W_s + 117.6548L_p + 65.130414W_p)\]
14.3.2.3 Validate the model
# Prediction
<- exp(48.28722 - 24.64201 * test.data$Sepal.Length[1] - 24.27893 * test.data$Sepal.Width[1] + 47.2458 * test.data$Petal.Length[1] + 3.130761 * test.data$Petal.Width[1])
o_ve <- exp(-83.41388 - 58.94158 * test.data$Sepal.Length[1] - 60.89885 * test.data$Sepal.Width[1] + 117.6548 * test.data$Petal.Length[1] + 65.130414 * test.data$Petal.Width[1])
o_vi <- o_ve / (o_ve + o_vi + 1)
p_ve <- o_vi / (o_ve + o_vi + 1)
p_vi <- 1 - p_ve - p_vi
p_se # Check the recall
$predicted <- predict(model, newdata = test.data)
test.datamean(test.data$predicted == test.data$Species) # rate
[1] 0.9666667
table(test.data$Species, test.data$predicted) # matrix
setosa versicolor virginica
setosa 10 0 0
versicolor 0 14 1
virginica 0 0 5
# Visualize in a diagram which showing the prediction result distribution
<- data.frame(probability.of.Species = model$fitted.values, Species=train.data$Species)
predicted.data1 $n <- 1:nrow(predicted.data1)
predicted.data1library(ggplot2)
ggplot(predicted.data1, aes(n,probability.of.Species.versicolor)) +
geom_point(aes(color=Species)) +
xlab("n") +
ylab("Predicted probability of versicolor")
Wald test:
Determine the significance of odds ratios in logistic regression model and calculate confidence interval.
- \(H_0\): \(\beta = 0\) (the Wald statistic is an approximate standard normal distribution) \[z_{score} =\frac{β}{s_e}\]
Z-score & P-value:
<- summary(model)$coefficients/summary(model)$standard.errors
z_score z_score
(Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
versicolor 0.4532436 -0.6856363 -0.8981256 0.9657296 0.09750075
virginica -0.7869252 -1.2744157 -1.8700922 2.8072031 2.21307796
<- pnorm(abs(z_score), lower.tail = FALSE) * 2
p p
(Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
versicolor 0.6503733 0.4929425 0.36911861 0.334179508 0.92232874
virginica 0.4313256 0.2025161 0.06147101 0.004997373 0.02689227
Confidence intervals:
confint(model) # One step
, , versicolor
2.5 % 97.5 %
(Intercept) -160.52147 257.09590
Sepal.Length -95.08380 45.79978
Sepal.Width -77.26243 28.70456
Petal.Length -48.64032 143.13193
Petal.Width -59.80392 66.06544
, , virginica
2.5 % 97.5 %
(Intercept) -291.16957 124.341812
Sepal.Length -149.58970 31.706534
Sepal.Width -124.72434 2.926639
Petal.Length 35.50928 199.800337
Petal.Width 7.44909 122.811738
14.3.2.4 Simply the model
AIC filter:
# Evaluate and improve the model automatic (In this example we already get the best model)
<- step(model)
model2 summary(model2)
<- summary(model2)$coefficients/summary(model2)$standard.errors
z_score
z_score<- pnorm(abs(z_score), lower.tail = FALSE) * 2
p p
VIF filter: VIF > 10 (R2 > 0.9)
# Test
<- train.data[, 1:4]
iriswor ::vif(iriswor) faraway
Sepal.Length Sepal.Width Petal.Length Petal.Width
6.946158 2.005345 29.045187 14.758382
# Filter
<- iriswor[, c('Sepal.Length', 'Sepal.Width')]
iriswor2 ::vif(iriswor2) faraway
Sepal.Length Sepal.Width
1.012571 1.012571
# Update
<- multinom(Species ~ Sepal.Length + Sepal.Width, data = train.data) model3
# weights: 12 (6 variable)
initial value 131.833475
iter 10 value 50.835156
iter 20 value 49.035250
iter 30 value 44.711778
iter 40 value 44.640621
iter 50 value 44.625237
iter 60 value 44.602328
iter 70 value 44.573805
iter 80 value 44.553198
iter 90 value 44.535668
iter 100 value 44.527089
final value 44.527089
stopped after 100 iterations
summary(model3)
Call:
multinom(formula = Species ~ Sepal.Length + Sepal.Width, data = train.data)
Coefficients:
(Intercept) Sepal.Length Sepal.Width
versicolor -67.69944 26.75297 -24.22665
virginica -79.78359 28.56638 -23.85885
Std. Errors:
(Intercept) Sepal.Length Sepal.Width
versicolor 59.21976 20.84157 18.96879
virginica 59.32647 20.84901 18.97688
Residual Deviance: 89.05418
AIC: 101.0542
<- summary(model3)$coefficients/summary(model3)$standard.errors
z_score z_score
(Intercept) Sepal.Length Sepal.Width
versicolor -1.143190 1.283635 -1.277185
virginica -1.344823 1.370155 -1.257259
<- pnorm(abs(z_score), lower.tail = FALSE) * 2
p p
(Intercept) Sepal.Length Sepal.Width
versicolor 0.2529597 0.1992697 0.2015371
virginica 0.1786824 0.1706384 0.2086598
14.3.2.5 Result
::stargazer(model, type = 'text') stargazer
==============================================
Dependent variable:
----------------------------
versicolor virginica
(1) (2)
----------------------------------------------
Sepal.Length -24.642 -58.942
(35.940) (46.250)
Sepal.Width -24.279 -60.899*
(27.033) (32.565)
Petal.Length 47.246 117.655***
(48.922) (41.912)
Petal.Width 3.131 65.130**
(32.110) (29.430)
Constant 48.287 -83.414
(106.537) (106.000)
----------------------------------------------
Akaike Inf. Crit. 21.474 21.474
==============================================
Note: *p<0.1; **p<0.05; ***p<0.01
15 Poisson regression
15.1 Concept
15.1.1 Definition
Poisson distribution:
A discrete probability distribution. The probability of occurrence of an event in a fixed spatial or temporal scales.
Derived from the binomial distribution for an infinite number of binomial distributions and an infinitesimal probability of each occurrence.
PDF for Poisson distribution (\(\lambda\) instead of \(\mu\) in binomial distribution): \[P\left( X=k \right)=\frac{\lambda^{k}}{k!}e^{-\lambda}\]
- \(P\left( X=k \right)\) represents the probability of a random event occurring \(k\) times per unit time
- \(k=0,1,2,...\)
- \(\lambda >0\) (the expected value of x)
# Compare Poisson distribution with Normal distribution
for (l in 1:5) {
curve(dpois(x, l),
from = 0, to = 10, n = 11,
add = l > 1, col = l,
ylab = 'Density', lwd = 3)
}legend('topright', legend = 1:5, title = bquote(lambda),
lty = 1, col = 1:5, lwd = 3)
for (l in 1:5) {
curve(dnorm(x, mean = l, sd = sqrt(l)),
from = 0, to = 10, n = 110,
col = l, lty = 2,
add = TRUE)
}
15.1.2 Features of Poisson distribution
- \(\mu=\sigma^2=\lambda\)
- If \(Y_1\) ~Poisson (\(\lambda_1\)), \(Y_2\)~Poisson(\(\lambda_2\)), then \(Y=Y1+Y2\)~Poisson(\(\lambda=\lambda_1+\lambda_2\))
- With the increase of \(\lambda\), the center of the distribution moves to the right, and the Poisson distribution approaches the normal distribution.
15.1.3 Assumptions
- The occurrence of an event is completely random.
- The occurrence of an event is independent.
- The probability \(p\) of the occurrence of an event remains unchanged.
- The probability \(p\) of the event is low and the sample size is large.
15.1.4 Model
\[log(Y)=a + b_1x_1 + b_2x_2 + b_nx_n.....\]
- \(Y\): response variable
- \(a\) and \(b\): numerical coefficients
- \(x\): predictive variable
15.2 Workflow
15.2.1 Fit model
Count of the plant species ~ the geographic variables
library(faraway)
data(gala)
<-
fit.gala glm(
~ Endemics + Area + Elevation + Nearest + Scruz + Adjacent,
Species data = gala,
family = poisson()
)summary(fit.gala)
Call:
glm(formula = Species ~ Endemics + Area + Elevation + Nearest +
Scruz + Adjacent, family = poisson(), data = gala)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.9919 -2.9305 -0.4296 1.3254 7.4735
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.828e+00 5.958e-02 47.471 < 2e-16 ***
Endemics 3.388e-02 1.741e-03 19.459 < 2e-16 ***
Area -1.067e-04 3.741e-05 -2.853 0.00433 **
Elevation 2.638e-04 1.934e-04 1.364 0.17264
Nearest 1.048e-02 1.611e-03 6.502 7.91e-11 ***
Scruz -6.835e-04 5.802e-04 -1.178 0.23877
Adjacent 4.539e-05 4.800e-05 0.946 0.34437
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 3510.73 on 29 degrees of freedom
Residual deviance: 313.36 on 23 degrees of freedom
AIC: 488.19
Number of Fisher Scoring iterations: 5
15.2.2 Simplify the model
As we can see in the previous result, the significant variables are Endemics, Area and Nearest. So update the model:
<-
fit.gala.reduced glm(Species ~ Endemics + Area + Nearest,
data = gala,
family = poisson())
summary(fit.gala.reduced)
Call:
glm(formula = Species ~ Endemics + Area + Nearest, family = poisson(),
data = gala)
Deviance Residuals:
Min 1Q Median 3Q Max
-5.085 -3.061 -0.593 2.095 6.770
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.868e+00 4.883e-02 58.729 < 2e-16 ***
Endemics 3.551e-02 7.327e-04 48.462 < 2e-16 ***
Area -4.542e-05 1.568e-05 -2.896 0.00378 **
Nearest 9.289e-03 1.319e-03 7.043 1.88e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 3510.73 on 29 degrees of freedom
Residual deviance: 330.84 on 26 degrees of freedom
AIC: 499.67
Number of Fisher Scoring iterations: 5
exp(coef(fit.gala.reduced)) # The effect of each variable on the number of species
(Intercept) Endemics Area Nearest
17.5995818 1.0361458 0.9999546 1.0093319
15.2.3 Overdispersion test
library(qcc)
qcc.overdispersion.test(gala$Species, type = "poisson")
Overdispersion test Obs.Var/Theor.Var Statistic p-value
poisson data 154.1737 4471.037 0
P-value is smaller than 0.05, it has overdispersion (variance is greater than the mean). In this case, the Poisson distribution no longer applies, because it assumes that the variance is equal to the mean. Here we can use quasipoisson()
function to solve it:
<-
fit.gala.reduced.od glm(Species ~ Endemics + Area + Nearest,
data = gala,
family = quasipoisson())
summary(fit.gala.reduced.od)
Call:
glm(formula = Species ~ Endemics + Area + Nearest, family = quasipoisson(),
data = gala)
Deviance Residuals:
Min 1Q Median 3Q Max
-5.085 -3.061 -0.593 2.095 6.770
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.868e+00 1.672e-01 17.152 1.08e-15 ***
Endemics 3.551e-02 2.509e-03 14.153 9.96e-14 ***
Area -4.542e-05 5.370e-05 -0.846 0.4054
Nearest 9.289e-03 4.516e-03 2.057 0.0499 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 11.72483)
Null deviance: 3510.73 on 29 degrees of freedom
Residual deviance: 330.84 on 26 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
15.2.4 Result
Area has no significant. Meanwhile, Endemics and Nearest have significant.
# Equation
::extract_eq(fit.gala.reduced.od, use_coefs = TRUE) equatiomatic
\[ \log ({ \widehat{E( \operatorname{Species} )} }) = 2.87 + 0.04(\operatorname{Endemics}) + 0(\operatorname{Area}) + 0.01(\operatorname{Nearest}) \]
# Analysis
::stargazer(fit.gala.reduced.od, type = 'text') stargazer
========================================
Dependent variable:
---------------------------
Species
----------------------------------------
Endemics 0.036***
(0.003)
Area -0.00005
(0.0001)
Nearest 0.009**
(0.005)
Constant 2.868***
(0.167)
----------------------------------------
Observations 30
========================================
Note: *p<0.1; **p<0.05; ***p<0.01
16 Non-linear regression
16.1 Concepts
16.1.1 Coefficients of linear regression (preview)
- SLR: \(y = b_0 + b_1x\)
- MLR: \(y = b_0 + b_1x_1 + b_2x_2\)
- Logistic Regression: \(\mathrm {log} \frac{y}{1-y} = \beta_0 + \beta_1 x\) or \(y = \frac{e^{\beta_0 + \beta_1x}}{\ 1 + e^{\beta_0 + \beta_1x}}\)
- Poisson Regression: \(\mathrm {log}(y)=a + bx\) or \(y = e^{a + bx}\)
16.1.2 Non-linear Regression definition
\[y_{i}=f(x_{i}, \theta)+\varepsilon_{i}, i=1,2,...,n\]
- \(y_{i}\): The response variable
- \(x_{i}=(x_{i1},x_{i2},...,x_{ik})'\): The explanatory variable
- \(\theta=(\theta_{0},\theta_{1},...\theta_{p})'\): Unknown parameter vector
- \(\varepsilon_{i}\): Random error term
A model for the relationship between the dependent variable(s) and the independent variable(s) is not linear but a curve.
Two categories:
- One that can be transformed into a linear model
- One that cannot be transformed into a linear model
Like:
- \(y = a + be^x\): Set x’ = e^x, then \(y = a + bx'\)
- \(z=a+be^{xy}\): Can’t be transformed into linear
Some Non-linear Regression model:
Name | Equation |
---|---|
Michaelis-Menten | \(y=\frac{a x}{b+cx}\) or \(y=\frac{a x}{b+x}\)… |
3-parameter asymptotic exponential | \(y=a-b \mathrm{e}^{-c x}\) |
3-parameter logistic | \(y=\frac{a}{1+b \mathrm{e}^{-c x}}\) |
Biexponential | \(y=a \mathrm{e}^{b x}-c \mathrm{e}^{-d x}\) |
… | … |
16.1.3 Visualization
# Parameters
<- 1
a <- 2
b <- 3
c <- 4 d
# curve() function way
par(mfrow = c(2,2))
curve(a * x / (b + c * x), -10, 10)
curve(a - b * exp(-c * x), -10, 10)
curve(a / (1 + b * exp(-c * x)), -10, 10)
curve(a * exp(b * x) - c * exp(-d * x), -10, 10)
# ggplot() function way
library(ggplot2)
library(gridExtra)
<- ggplot(data.frame(x = seq(-10, 10, length.out = 100)), aes(x = x)) +
plot1 geom_line(aes(y = a * x / (b + c * x))) +
ggtitle("Michaelis-Menten") +
ylab(expression(y == frac(a * x, b + c * x))) +
theme_classic()
<- ggplot(data.frame(x = seq(-10, 10, length.out = 100)), aes(x = x)) +
plot2 geom_line(aes(y = a - b * exp(-c * x))) +
ggtitle("3-parameter asymptotic exponential") +
ylab(expression(y == a - b * e^(-c * x))) +
theme_classic()
<- ggplot(data.frame(x = seq(-10, 10, length.out = 100)), aes(x = x)) +
plot3 geom_line(aes(y = a / (1 + b * exp(-c * x)))) +
ggtitle("3-parameter logistic") +
ylab(expression(y == frac(a, 1 + b * e^(-c * x)))) +
theme_classic()
<- ggplot(data.frame(x = seq(-10, 10, length.out = 100)), aes(x = x)) +
plot4 geom_line(aes(y = a * exp(b * x) - c * exp(-d * x))) +
ggtitle("Biexponential") +
ylab(expression(y == a * e^(b * x) - c * e^(-d * x))) +
theme_classic()
grid.arrange(plot1, plot2, plot3, plot4, ncol = 2)
16.1.4 Coefficient
Coefficient of determination (\(R^2\)) formula:
\[R^2=\frac{SSR}{SST}=\frac{\sum_{}^{}(y'-\bar{y})^2}{\sum_{}^{}(y-\bar{y})^2}\]
Pearson Correlation Coefficient (\(r\)) formula:
\[r =\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sqrt{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2} \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}}\]
16.1.5 Partial coefficient of determination
For linear regression, \(r^2 = R^2\). But for non-linear regression, because \(SST \neq SSE + SSR\) so \(R^2 \neq r^2\).
Measure the fitting of non-linear regression model. Its definition formula is:
\[R^2=1-\frac{SSE}{SST}\]
16.2 Workflow
16.2.1 Original data
Reaction rate ~ concentration
<- read.table("data/mm.txt", header=T)
dtf library(ggplot2)
ggplot(dtf) + geom_point(aes(conc, rate)) + theme_bw()
16.2.2 Fit the model manually
Fit Michaelis-Menten model (m1
): \(y=\frac{a x}{b+x}\)
Features:
- The curve passes through the origin
- Rising rate of the curve diminishes with the increasing of \(x\)
- Function has an asymptote \(y = a\)
- \(y = \frac{a}{2}\) when \(x = b\)
Fit 3-parameter asymptotic exponential model (m2
): \(y=a-b \mathrm{e}^{-c x}\)
<- nls(rate ~ a * conc / (b + conc), data = dtf, start = list(a = 200, b = 0.03))
m1 <- nls(rate ~ a - b * exp(-c * conc), data = dtf, start = list(a = 200, b = 150, c = 5)) m2
16.2.3 Automatic way
Find the start value of model parameters is very hard, so here are some automatic functions:
Function | Model |
---|---|
SSasymp() |
asymptotic regression model |
SSasympOff() |
asymptotic regression model with an offset |
SSasympOrig() |
asymptotic regression model through the origin |
SSbiexp() |
biexponential model |
SSfol() |
first-order compartment model |
SSfpl() |
four-parameter logistic model |
SSgompertz() |
Gompertz growth model |
SSlogis() |
logistic model |
SSmicmen() |
Michaelis–Menten model |
SSweibull() |
Weibull growth curve model |
<- nls(rate ~ SSmicmen(conc, a, b), data = dtf)
m1 m1
Nonlinear regression model
model: rate ~ SSmicmen(conc, a, b)
data: dtf
a b
212.68371 0.06412
residual sum-of-squares: 1195
Number of iterations to convergence: 0
Achieved convergence tolerance: 1.937e-06
Result: \(r = \frac{212.7c}{0.06412+c}\)
summary(m1)
Formula: rate ~ SSmicmen(conc, a, b)
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 2.127e+02 6.947e+00 30.615 3.24e-11 ***
b 6.412e-02 8.281e-03 7.743 1.57e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.93 on 10 degrees of freedom
Number of iterations to convergence: 0
Achieved convergence tolerance: 1.937e-06
<- nls(rate ~ SSasympOff(conc, a, b, c), data = dtf)
m2 m2
Nonlinear regression model
model: rate ~ SSasympOff(conc, a, b, c)
data: dtf
a b c
200.94280 1.85368 -0.04298
residual sum-of-squares: 1009
Number of iterations to convergence: 0
Achieved convergence tolerance: 1.768e-06
Result: \(r = 201 - 153 e^{-6.38c}\)
summary(m2)
Formula: rate ~ SSasympOff(conc, a, b, c)
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 200.94280 5.97460 33.633 8.95e-11 ***
b 1.85368 0.16218 11.430 1.16e-06 ***
c -0.04298 0.01486 -2.893 0.0178 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.59 on 9 degrees of freedom
Number of iterations to convergence: 0
Achieved convergence tolerance: 1.768e-06
16.2.4 Partial coefficient of determination
Formula: \(R^{2}=1-\frac{SSE}{SST}\)
<- function(model) {
calc_R2 <- summary(model)
ms <- as.vector((ms[[3]])^2 * ms$df[2])
sse <- lm(dtf$rate ~ 1)
null <- as.vector(unlist(summary.aov(null)[[1]][2]))
sst 1 - sse/sst
}c(calc_R2(m1), calc_R2(m2))
[1] 0.9612608 0.9672987
16.2.5 Visualization
<- seq(0, 1.2, .01)
xv <- predict(m1, list(conc = xv))
y1 <- predict(m2, list(conc = xv))
y2 <- data.frame(xv, y1, y2)
dtf_predicted ggplot(dtf) +
geom_point(aes(conc, rate)) +
geom_line(aes(xv, y1, color = 'Michaelis-Menten model'), data = dtf_predicted) +
geom_line(aes(xv, y2, color = '3-parameter asymptotic exponential model'), data = dtf_predicted) +
theme_bw() +
scale_color_manual(values = c('Michaelis-Menten model' = 'blue', '3-parameter asymptotic exponential model' = 'red')) +
labs(color = 'Model') +
guides(color = guide_legend(title = 'Models'))
17 Tests for distributions
17.1 Distributions and their R functions
Distribution | d | p | q | r |
---|---|---|---|---|
Beta | pbeta() |
qbeta() |
dbeta() |
rbeta() |
Binomial | pbinom() |
qbinom() |
dbinom() |
rbinom() |
Cauchy | pcauchy() |
qcauchy() |
dcauchy() |
rcauchy() |
Chi-Square | pchisq() |
qchisq() |
dchisq() |
rchisq() |
Exponential) | pexp() |
qexp() |
dexp() |
rexp() |
F | pf() |
qf() |
df() |
rf() |
Gamma | pgamma() |
qgamma() |
dgamma() |
rgamma() |
Geometric | pgeom() |
qgeom() |
dgeom() |
rgeom() |
Hyper geometric | phyper() |
qhyper() |
dhyper() |
rhyper() |
Logistic | plogis() |
qlogis() |
dlogis() |
rlogis() |
Log Normal | plnorm() |
qlnorm() |
dlnorm() |
rlnorm() |
Negative Binomial | pnbinom() |
qnbinom() |
dnbinom() |
rnbinom() |
Normal | pnorm() |
qnorm() |
dnorm() |
rnorm() |
Poisson | ppois() |
qpois() |
dpois() |
rpois() |
Student t | pt() |
qt() |
dt() |
rt() |
Studentized Range | ptukey() |
qtukey() |
dtukey() |
rtukey() |
Uniform | punif() |
qunif() |
dunif() |
runif() |
Weibull | pweibull() |
qweibull() |
dweibull() |
rweibull() |
Wilcoxon Rank Sum Statistic | pwilcox() |
qwilcox() |
dwilcox() |
rwilcox() |
Wilcoxon Signed Rank Statistic | psignrank() |
qsignrank() |
dsignrank() |
rsignrank() |
Distribution functions in R: prefix
+ root_name
- d for “density”, the density function (PDF).
- p for “probability”, the cumulative distribution function (CDF).
- q for “quantile” or “critical”, the inverse of CDF.
- r for “random”, a random variable having the specified distribution
Some common distribution:
Distribution | Parameters | Expression | Mean | Variance |
---|---|---|---|---|
Bernoulli trial | \(p\) | \(q\) | \(p\) | \(pq\) |
Binomial | \(n, p\) | \(C(n, x) p^x q^{n-x}\) | \(np\) | \(npq\) |
Poisson | \(\lambda\) | \(e^{-\lambda} \lambda^x / x !\) | \(\lambda\) | \(\lambda\) |
Normal | \(\mu, \sigma\) | \(\frac{1}{\sqrt{2 \pi \sigma}} e^{-(x-\mu)^2 / 2 \sigma^2}\) | \(\mu\) | \(\sigma^2\) |
Std. Normal | \(z\) | \(\frac{1}{\sqrt{2 \pi}} e^{-z^2 / 2}\) | \(0\) | \(1\) |
Student-t | \(v\) | \(\frac{\Gamma\left(\frac{v+1}{2}\right)}{\sqrt{v \pi} \Gamma\left(\frac{v}{2}\right)}\left(1+\frac{x^2}{v}\right)^{-\frac{v+1}{2}}\) | \(0\) | \(\left\{\begin{array}{cc}v /(v-2) & \text { for } v>2 \\ \infty & \text { for } 1<v \leq 2 \\ \text { undefined } & \text { otherwise }\end{array}\right.\) |
17.2 Empirical distribution function (EDF or eCDF)
17.2.1 Concepts
\[F_{n}(x)= \begin{cases}0, & \text { for } x<y_{1} \\ k / n, & \text { for } y_{k} \leqslant x<y_{k+1}\qquad (k=1,2, \ldots, n-1) \\ 1, & \text { for } x \geqslant y_{n}\end{cases}\]
Empirical distribution function (EDF) is a way of representing the cumulative distribution function (CDF) of a set of data. The EDF is calculated by ordering the data from smallest to largest, and then assigning a probability to each data point based on its rank in the distribution.
Specifically, the EDF assigns a probability of 1/n to each of the n data points, where n is the sample size. Then, the EDF at any given point x is the proportion of data points that are less than or equal to x.
17.2.2 Visualization
code
par(mfrow = c(2, 2))
<- rnorm(10)
x1 <- rnorm(20)
x2 <- rnorm(50)
x3 <- rnorm(100)
x4 plot(ecdf(x1))
curve(pnorm, col = 'red', add = TRUE)
plot(ecdf(x2))
curve(pnorm, col = 'red', add = TRUE)
plot(ecdf(x3))
curve(pnorm, col = 'red', add = TRUE)
plot(ecdf(x4))
curve(pnorm, col = 'red', add = TRUE)
17.3 Kolmogorov-Smirnov test
17.3.1 Concepts
The KS test works by comparing the maximum distance between the EDF and the CDF, called the KS statistic, to a critical value calculated from the null distribution of the test statistic. If the calculated KS statistic is larger than the critical value, the null hypothesis that the sample is drawn from the theoretical distribution is rejected.
critical values for D for one sample (\(n\le40\)):
Asymptotic critical values for \(D\) for one sample (small sample sizes):
\(\alpha\) | 0.2 | 0.1 | 0.05 | 0.02 | 0.01 |
---|---|---|---|---|---|
Critical \(D\) | \(1.07\sqrt{\frac{1}{n}}\) | \(1.14\cdot\sqrt{\frac{1}{n}}\) | \(1.22\cdot\sqrt{\frac{1}{n}}\) | \(1.52\cdot\sqrt{\frac{1}{n}}\) | \(1.63 \cdot\sqrt{\frac{1}{n}}\) |
critical values for D for two sample (\(n\le40\)):
Asymptotic critical values for \(D\) for one sample (large samle sizes):
\(\alpha\) | 0.2 | 0.1 | 0.05 | 0.02 | 0.01 |
---|---|---|---|---|---|
Critical \(D\) | \(1.07\sqrt{\frac{m+n}{n}}\) | \(1.14\cdot\sqrt{\frac{m+n}{n}}\) | \(1.22\cdot\sqrt{\frac{m+n}{n}}\) | \(1.52\cdot\sqrt{\frac{m+n}{n}}\) | \(1.63 \cdot\sqrt{\frac{m+n}{n}}\) |
17.3.2 Visualization
code
# package ‘kolmin’ is not available for this version of R
library(kolmim)
<- pkolm(d = 0.1, n = 30)
pvalue <- c(10, 20, 30, 50, 80, 100)
nks <- seq(0.01, 1, 0.01)
x <- sapply(x, pkolm, n = nks[1])
y plot(x, y, type = 'l', las = 1, xlab = 'D', ylab = 'CDF')
for (i in 2:length(nks)) lines(x, sapply(x, pkolm, n = nks[i]), col = i)
17.3.3 One-sample test
Hypothesis:
\(H_{0}: F(x)=F_{0}(x)\) The CDF of x is a given reference distribution function \(F_0(x)\).
Test statistic:
\(D_{n}=\sup _{x}\left[\left|F_{n}(x)-F_{0}(x)\right|\right]\)
\(D_{n}\): the supremum distance between the empirical distribution function of given sample and the cumulative distribution function of the given reference distribution.
Example:
# Original dataset
<- c(108, 112, 117, 130, 111, 131, 113, 113, 105, 128)
x0 <- mean(x0)
x0_mean <- sd(x0)
x0_sd <- ecdf(x0)
eCDF # cumulative distribution function F(x) of the reference distribution
<- pnorm(x0, x0_mean, x0_sd)
CDF # create a data frame to put values into
<- data.frame(data = x0, eCDF = eCDF(x0), CDF = CDF)
df # visualization
library(ggplot2)
ggplot(df, aes(data)) +
stat_ecdf(size=1,aes(colour = "Empirical CDF (Fn(x))")) +
stat_function(fun = pnorm, args = list(x0_mean, x0_sd), aes(colour = "Theoretical CDF (F(x))")) +
xlab("Sample data") +
ylab("Cumulative probability") +
scale_y_continuous(breaks=seq(0, 1, by = 0.2))+
theme(legend.title=element_blank())
# sort values of sample observations and remove duplicates
<- unique(sort(x0))
x # Calculate D
<- abs(eCDF(x) - pnorm(x, x0_mean, x0_sd))
Daft Daft
[1] 0.005874162 0.024147672 0.030327372 0.094264640 0.256212096 0.191556764
[7] 0.082045735 0.018782968 0.066450468
<- abs(c(0, eCDF(x)[-length(x)]) - pnorm(x, x0_mean, x0_sd))
Dbef Dbef
[1] 0.10587416 0.07585233 0.06967263 0.00573536 0.05621210 0.09155676 0.18204573
[8] 0.11878297 0.03354953
<- max(c(Daft, Dbef))
D_score D_score
[1] 0.2562121
One step in R:
ks.test(x0,'pnorm', x0_mean, x0_sd)
Asymptotic one-sample Kolmogorov-Smirnov test
data: x0
D = 0.25621, p-value = 0.5276
alternative hypothesis: two-sided
Decision:
We cannot reject \(H_0\) that the data are normally distributed at \(\alpha = 0.05\).
Conclusion:
The data follows a normal distribution.
17.3.4 Two-sample test
Hypothesis:
\(H_0\): Two populations follow a common probability distribution.
Test statistic:
\(D_{n,m}=\sup _{x}\left[\left|F_{1, n}(x)-F_{2, m}(x)\right|\right]\)
Example:
# Original dataset
<- c(165, 168, 172, 177, 180, 182)
sample1 <- c(163, 167, 169, 175, 175, 179, 183, 185)
sample2 <- mean(sample1)
mean1 <- mean(sample2)
mean2 <- sd(sample1)
sd1 <- sd(sample2)
sd2 # empirical distribution function Fn(x)
<- ecdf(sample1)
eCDF1 eCDF1(sample1)
[1] 0.1666667 0.3333333 0.5000000 0.6666667 0.8333333 1.0000000
# empirical distribution function Fm(x)
<- ecdf(sample2)
eCDF2 eCDF2(sample2)
[1] 0.125 0.250 0.375 0.625 0.625 0.750 0.875 1.000
# visualization
<- c(rep("sample1", length(sample1)), rep("sample2", length(sample2)))
group2 <- data.frame(all = c(sample1,sample2), group = group2)
df2 library(ggplot2)
ggplot(df2, aes(x = all, group = group, color = group2)) +
stat_ecdf(size=1) +
xlab("Sample data") +
ylab("Cumulative probability") +
theme(legend.title=element_blank())
# merge, sort observations of two samples, and remove duplicates
<- unique(sort(c(sample1,sample2)))
x2 # calculate D and its location
<- max(abs(eCDF1(x2) - eCDF2(x2)))
D2 <- which.max(abs(eCDF1(x2) - eCDF2(x2))) # the index of x-axis value
idxD2 <- x2[idxD2] # corresponding x-axis value
xD2 D2
[1] 0.25
One step in R:
ks.test(sample1, sample2)
Exact two-sample Kolmogorov-Smirnov test
data: sample1 and sample2
D = 0.25, p-value = 0.9281
alternative hypothesis: two-sided
Decision:
We cannot reject the \(H_0\) at \(\alpha = 0.05\).
Test statistic:
We can conclude that sample 1 and sample 2 come from the same distributions.
18 Principal component analysis
18.1 Concepts
18.1.1 Defination
A dimensionality reduction technique that enables identification of correlations and patterns in a data set so that it can be transformed into a data set of significantly lower dimension without loss of any important information.18.1.2 Procedure
Move the center of the coordinate axis to the center of the data, and then rotate the coordinate axis to maximize the variance of the data projected on the C1 axis. C1 is called “the first principal component”. That is, the projection of all the data in this direction is the most scattered.
Find an axis (C2), so that the covariance (correlation coefficient) between C2 and C1 is 0, so as to avoid overlapping with C1 information. Make the variance of data in this direction as large as possible. C2 is called “the second principal component”.
Similarly, find the third principal component, the fourth principal component… The \(n_{th}\) principal component. \(n\) random variables can have n principal components.
18.1.3 Usages in R
Two functions:
prcomp()
princomp()
Example:
<- prcomp(iris[,1:4], center = TRUE, scale. = TRUE)
com1 summary(com1)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
<- as.matrix(scale(iris[, 1:4])) # First need to do normalization
dt <- princomp(dt, cor = T)
com2 summary(com2)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 1.7083611 0.9560494 0.38308860 0.143926497
Proportion of Variance 0.7296245 0.2285076 0.03668922 0.005178709
Cumulative Proportion 0.7296245 0.9581321 0.99482129 1.000000000
Warning: only applies to matrices with more rows than columns.
18.2 Data transformation
Normalization | Standardization |
---|---|
\(x_{norm} = \frac{x -x_{min}}{x_{max} - x_{min}}\) | \(z = \frac{x-\mu}{\sigma}\) |
Minimum and maximum value of features are used for scaling | Mean and standard deviation is used for scaling |
When features are of different scales | When we want to ensure zero mean and unit standard deviation |
0~1 | Not bounded to a certain range. |
Much affected by outliers | Less affected by outliers |
It is useful when we don’t know about the distribution | It is useful when the feature distribution is Normal or Gaussian |
18.3 Workflow
18.3.1 Data preprocess
# Iris example
<- as.matrix(scale(iris[, 1:4]))
dt head(dt, 4)
Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,] -0.8976739 1.01560199 -1.335752 -1.311052
[2,] -1.1392005 -0.13153881 -1.335752 -1.311052
[3,] -1.3807271 0.32731751 -1.392399 -1.311052
[4,] -1.5014904 0.09788935 -1.279104 -1.311052
18.3.2 Correlation coefficient (covariance) matrix
<- cor(dt)
rm1 rm1
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
18.3.3 Eigenvalues and eigenvectors
<- eigen(rm1)
rs1 # Extract the eigenvalue in the result, that is the variance of each principal component
<- rs1$values
val # Convert to standard deviation
<- sqrt(val)
Standard_deviation # Calculate variance contribution rate and cumulative contribution rate
<- val / sum(val)
Proportion_of_Variance <- cumsum(Proportion_of_Variance)
Cumulative_Proportion # Draw scree plot
plot(val, type = "b", las = 1,
xlab = "Principal component number", ylab = "Principal component variance")
Then choose the right principal component number
18.3.4 Principal component score
# Feature vectors in extraction results
<- as.matrix(rs1$vectors)
U # Perform matrix multiplication to obtain PC score
<- dt %*% U
PC colnames(PC) <- c("PC1","PC2","PC3","PC4")
<- data.frame(PC, iris$Species)
df head(df, 4)
PC1 PC2 PC3 PC4 iris.Species
1 -2.257141 -0.4784238 0.1272796 0.02408751 setosa
2 -2.074013 0.6718827 0.2338255 0.10266284 setosa
3 -2.356335 0.3407664 -0.0440539 0.02828231 setosa
4 -2.291707 0.5953999 -0.0909853 -0.06573534 setosa
18.3.5 One step way
<- prcomp(iris[, 1:4], center = TRUE, scale. = TRUE)
com <- data.frame(com$x, iris$Species) df
18.3.6 Visualization
Draw the main component dispersion point diagram:
library(ggplot2)
# Extract the variance contribution rate of the principal component and generate the coordinate axis title
<- paste0("PC1 (", round(Proportion_of_Variance[1] * 100, 2), "%)")
xlab <- paste0("PC2 (", round(Proportion_of_Variance[2] * 100, 2), "%)")
ylab # Draw a scatter plot and add a confidence ellipse
ggplot(data = df, aes(x = PC1, y = PC2, color = iris.Species)) +
geom_point() + labs(x = xlab, y = ylab) +
stat_ellipse(
aes(fill = iris.Species),
type = "norm",
geom = "polygon",
alpha = 0.2,
color = NA
)
19 Cluster analysis
19.1 Concepts
19.1.1 Definition
Cluster analysis/Clustering: A simple and intuitive method of data simplification. It splits a number of observations or measured variables into groups that are similar in their characteristics or behaviors.
Cluster: A group of objects such that the objects in the cluster are closer (more similar) to the ‘centre’ of their cluster than to the centre of any other cluster. The centre of a cluster is usually the centroid, the average of all the points in the cluster, or the most ‘representative’ point in the cluster.
19.1.2 K-means
Centroid models: Iterative clustering algorithms. Iteratively correct the centroid positions of the clusters and adjust the classification of each data point.
K-mean: Classify a given data set by a certain number (K) of predefined clusters. The most common centroid-based clustering algorithm. Find the partition of the n individuals into k groups that minimises the within-group sum of squares over all variables.
Steps:
- Specify the number of clusters k.
- Assign points to clusters randomly.
- Find the centroids of each cluster.
- Re-assign points according to their closest centroid.
- Re-adjust the positions of the cluster centroids.
- Repeat steps 4 and 5 until no further changes are there.
Advantages:
- The fastest centroid-based algorithm.
- Work for large data sets.
- Reduce intra-cluster variance
Disadvantages:
- Performance is affected when there is more noise in the data.
- Outliers can never be studied.
- Even though it reduces intra-cluster variance, it can’t affect or deal with the global minimum variance of measure.
- Very sensitive at clustering data sets of non-convex shaped clusters.
Here is the number of possible partitions depending on the sample size n and number of clusters k. It would be impossible to make a complete enumeration of every possible partition, even with the fastest computers.
19.1.3 Hierarchical cluster analysis (HCA)
HCA:
Suitable for clustering of small data sets because of the cost on time and space.
Approaches:
Agglomerative (AGNES): Begin with each observation in a distinct (singleton) cluster, and successively merges clusters together until a stopping criterion is satisfied. Bottom-up.
Divisive (DIANA): Begin with all patterns in a single cluster and performs splitting until a stopping criterion is met. Top-down.
Advantages:
- The similarity of distance and rules is easy to define and less restrictive.
- There is not required to specify the clustering number in advance.
- It can discover the hierarchy of clusters, showing by the tree.
Disadvantages:
- The algorithm is complex.
- It does not scale well, because of the difficulty of choosing merging or splitting points.
19.2 Workflow
19.2.1 Data preprocess
- Same unit —- because of the ignoring of physical meanings, focusing on number.
- Same dimension —- because of the measuring of distance.
<- iris[1:4]
newIris head(newIris, 4)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
19.2.2 K-means
Compare the Total Within Sum of Square under different k values, then determine the optimal clustering number:
library(factoextra)
fviz_nbclust(newIris, kmeans, method = "wss")
Set k = 3, clustering:
<- kmeans(newIris, 3) kc
Create a cross tabulation of clustering and original species to Check the results:
table(iris$Species,kc$cluster)
1 2 3
setosa 17 33 0
versicolor 4 0 46
virginica 0 0 50
Visualization:
par(mfrow = c(1, 2))
plot(newIris[c("Sepal.Length", "Sepal.Width")],
col = kc$cluster,
pch = as.integer(iris$Species))
points(kc$centers,
col = 1:3,
pch = 16,
cex = 2)
plot(newIris[c("Petal.Length", "Petal.Width")],
col = kc$cluster,
pch = as.integer(iris$Species))
points(kc$centers,
col = 1:3,
pch = 16,
cex = 2)
19.2.3 Hierarchical cluster analysis
Box-Cox transformation:explain
A flexible data normalization method which can minimize the skewness. More suitable for the (applied geochemical and) general environmental data. The data for Box-Cox transformation is required to be positive and continuous variables.19.2.3.1 First 3 colomns iris dataset
<- as.matrix(iris[2:4, -5])
iris3R row.names(iris3R) <- c("Object1", "Object2", "Object3")
library(forecast)
<- BoxCox(iris3R, lambda = "auto")
iris.transformed iris.transformed
Sepal.Length Sepal.Width Petal.Length Petal.Width
Object1 3.725276 1.941234 0.3967354 -0.8225575
Object2 3.539252 2.131053 0.2981108 -0.8225575
Object3 3.446104 2.036214 0.4950348 -0.8225575
attr(,"lambda")
[1] 0.9538139
19.2.3.2 euclidean
method
Default using “euclidean” for distance measure if not choose methods. Euclidean is a good choice when using low-dimensional data, but usually, it need to normalize the data before.
# calculate the distance matrix
<- dist(as.matrix(iris.transformed), method = "euclidean")
distE distE
Object1 Object2
Object2 0.2834833
Object3 0.3108387 0.2375921
The distance between Object2 and Object3 is minimum.
Perform HCA (AGNES):
<- hclust(distE, method = "complete")
hc1 # default using "complete"(-linkage) as linkage method if not choose methods
# Draw the dendrogram(Because the cluster solutions grow tree-like)
plot(hc1)
Result: Object2 and Object3 are more similar, and Object1 is more different.
19.2.3.3 canberra
method
But if change the distance measure to “canberra”, the situation will be slightly different:
<- dist(as.matrix(iris.transformed), method = "canberra")
distC distC
Object1 Object2
Object2 0.2141568
Object3 0.1730378 0.2843751
<- hclust(distC, method = "complete")
hc2 plot(hc2)
19.2.3.4 Decision
Use agglomerative coefficients (AC, for AGNES) and divisive coefficient (DC, for DIANA) to compare the strength of the hierarchical clustering structures. The stronger the clustering structure, the higher the coefficient (closer to 1).
library(cluster)
# help("coef.hclust")
coef.hclust(hc1)
[1] 0.1570945
coef.hclust(hc2)
[1] 0.2610104
Result: As “canberra” has a higher AC, it is stronger.
19.2.3.5 Whole iris dataset clustering
library(cluster)
library(forecast)
<- BoxCox(as.matrix(iris[,-5]), lambda = "auto") iris.trans
Perform AGNES:
<- agnes(iris.trans, metric = "euclidean", method = "average") # The default setting `metric = "euclidean"`
irisHcA # Cut the AGNES tree at cluster number(k) = 3
<- cutree(irisHcA, k = 3)
hclusterA <- table(true = iris$Species, cluster = hclusterA)
tb1 tb1
cluster
true 1 2 3
setosa 50 0 0
versicolor 0 46 4
virginica 0 50 0
# Extract the agglomerative coefficient. When using agnes/diana(), it's automatically calculated
$ac # k-value will not affect the agglomerative coefficients irisHcA
[1] 0.9378719
<- agnes(iris.trans, method = "single")
irisHcS <- cutree(irisHcS, k = 3)
hclusterS <- table(true = iris$Species, cluster = hclusterS)
tb2 tb2
cluster
true 1 2 3
setosa 49 1 0
versicolor 0 0 50
virginica 0 0 50
$ac irisHcS
[1] 0.8807232
Perform DIANA:
<- diana(iris.trans) # No linkage method, default setting
irisHcD # Cut at k = 3
<- cutree(irisHcD, k = 3)
hclusterD <- table(true = iris$Species, cluster = hclusterD)
tb3 tb3
cluster
true 1 2 3
setosa 50 0 0
versicolor 0 46 4
virginica 0 3 47
# Extract divisive coefficient
$dc irisHcD
[1] 0.9555297
Result: The no linkage method is better than the average method and the single method in this model.
SessionInfo:
R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)
Matrix products: default
locale:
[1] LC_COLLATE=Chinese (Simplified)_China.utf8
[2] LC_CTYPE=Chinese (Simplified)_China.utf8
[3] LC_MONETARY=Chinese (Simplified)_China.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] cluster_2.1.4 forecast_8.21 factoextra_1.0.7 gridExtra_2.3
[5] qcc_2.7 faraway_1.0.8 nnet_7.3-18 jmv_2.3.4
[9] car_3.1-2 carData_3.0-5 ISLR_1.4 biotools_4.2
[13] mvnormalTest_1.0.0 rstatix_0.7.2 MASS_7.3-58.2 gplots_3.1.3
[17] agricolae_1.3-5 openair_2.16-0 lubridate_1.9.2 forcats_1.0.0
[21] stringr_1.5.0 dplyr_1.1.0 purrr_1.0.1 readr_2.1.4
[25] tibble_3.2.1 tidyverse_2.0.0 latex2exp_0.9.6 patchwork_1.1.2
[29] ggplot2_3.4.2 Epi_2.47 tidyr_1.3.0 magrittr_2.0.3
[33] knitr_1.42
loaded via a namespace (and not attached):
[1] backports_1.4.1 copula_1.1-2 corrplot_0.92
[4] plyr_1.8.8 splines_4.2.3 pspline_1.0-19
[7] listenv_0.9.0 AlgDesign_1.2.1 equatiomatic_0.3.1
[10] digest_0.6.31 foreach_1.5.2 htmltools_0.5.4
[13] fansi_1.0.3 tzdb_0.3.0 etm_1.1.1
[16] globals_0.16.2 recipes_1.0.5 gower_1.0.1
[19] stabledist_0.7-1 xts_0.13.1 hardhat_1.3.0
[22] timechange_0.2.0 tseries_0.10-54 jpeg_0.1-10
[25] colorspace_2.1-0 ggrepel_0.9.3 haven_2.5.2
[28] xfun_0.37 jsonlite_1.8.4 hexbin_1.28.3
[31] lme4_1.1-31 survival_3.5-3 zoo_1.8-11
[34] iterators_1.0.14 glue_1.6.2 gtable_0.3.3
[37] ipred_0.9-14 questionr_0.7.8 future.apply_1.10.0
[40] quantmod_0.4.22 maps_3.4.1 abind_1.4-5
[43] scales_1.2.1 pscl_1.5.5 mvtnorm_1.1-3
[46] GGally_2.1.2 mvnormtest_0.1-9 miniUI_0.1.1.1
[49] Rcpp_1.0.10 xtable_1.8-4 cmprsk_2.2-11
[52] mapproj_1.2.11 lava_1.7.2.1 prodlim_2023.03.31
[55] stats4_4.2.3 htmlwidgets_1.6.2 RColorBrewer_1.1-3
[58] ellipsis_0.3.2 pkgconfig_2.0.3 reshape_0.8.9
[61] farver_2.1.1 deldir_1.0-6 utf8_1.2.3
[64] caret_6.0-94 reshape2_1.4.4 tidyselect_1.2.0
[67] labeling_0.4.2 rlang_1.1.0 later_1.3.0
[70] munsell_0.5.0 tools_4.2.3 cli_3.6.1
[73] jmvcore_2.3.19 generics_0.1.3 moments_0.14.1
[76] broom_1.0.4 evaluate_0.20 fastmap_1.1.0
[79] yaml_2.3.7 ModelMetrics_1.2.2.2 caTools_1.18.2
[82] future_1.32.0 nlme_3.1-162 mime_0.12
[85] compiler_4.2.3 rstudioapi_0.14 curl_5.0.0
[88] png_0.1-8 ggsignif_0.6.4 klaR_1.7-2
[91] pcaPP_2.0-3 stringi_1.7.12 gsl_2.1-8
[94] highr_0.10 stargazer_5.2.3 lattice_0.20-45
[97] Matrix_1.5-3 nloptr_2.0.3 urca_1.3-3
[100] vctrs_0.6.2 pillar_1.9.0 lifecycle_1.0.3
[103] ADGofTest_0.3 lmtest_0.9-40 combinat_0.0-8
[106] data.table_1.14.8 bitops_1.0-7 httpuv_1.6.9
[109] R6_2.5.1 latticeExtra_0.6-30 promises_1.2.0.1
[112] KernSmooth_2.23-20 parallelly_1.35.0 codetools_0.2-19
[115] boot_1.3-28.1 gtools_3.9.4 withr_2.5.0
[118] nortest_1.0-4 fracdiff_1.5-2 mgcv_1.8-42
[121] parallel_4.2.3 hms_1.1.3 quadprog_1.5-8
[124] rpart_4.1.19 grid_4.2.3 labelled_2.11.0
[127] timeDate_4022.108 class_7.3-21 minqa_1.2.5
[130] rmarkdown_2.21 TTR_0.24.3 ggpubr_0.6.0
[133] pROC_1.18.0 numDeriv_2016.8-1.1 shiny_1.7.4
[136] base64enc_0.1-3 interp_1.1-4